Donor Node Crash when Re-joining After Brief Disconnect on Solaris

Bug #1287567 reported by Jervin R
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Galera
New
Undecided
Unassigned
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Incomplete
Undecided
Unassigned

Bug Description

2014-03-04 03:36:30 92137 [Note] WSREP: Node 0.0 (node2) requested state transfer from 'node1'. Selected 1.0 (node1)(SYNCED) as donor.
2014-03-04 03:36:30 92137 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 365188155)
03:36:30 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Please help us make Percona XtraDB Cluster better by reporting any
bugs at https://bugs.launchpad.net/percona-xtradb-cluster

key_buffer_size=8388608
read_buffer_size=131072
max_used_connections=2759
max_threads=5002
thread_count=1912
connection_count=1912
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 2002255 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x2a64f1560
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
/mysql/bin/mysqld'my_print_stacktrace+0x1e [0xab285e]
/mysql/bin/mysqld'handle_fatal_signal+0x36c [0x8690fc]
/lib/amd64/libc.so.1'__sighndlr+0x6 [0xfffffd7fff25f846]
/lib/amd64/libc.so.1'call_user_handler+0x1db [0xfffffd7fff25269b]
/lib/amd64/libc.so.1'strlen+0x30 [0xfffffd7fff1c9610] [Signal 11 (SEGV)]
/lib/amd64/libc.so.1'strndup+0x46 [0xfffffd7fff200e46]
/mysql/lib/libgalera_smm.so'_ZN6galera13ReplicatorSMM17process_state_reqEPvPKvmll+0xe9 [0xfffffd7ff7cc6339]
/mysql/lib/libgalera_smm.so'_ZN6galera15GcsActionSource8dispatchEPvRK10gcs_actionRb+0x42c [0xfffffd7ff7c9629c]
/mysql/lib/libgalera_smm.so'_ZN6galera15GcsActionSource7processEPvRb+0x5c [0xfffffd7ff7c96fcc]
/mysql/lib/libgalera_smm.so'_ZN6galera13ReplicatorSMM10async_recvEPv+0x77 [0xfffffd7ff7cb4327]
/mysql/lib/libgalera_smm.so'galera_recv+0x24 [0xfffffd7ff7cc8af4]
/mysql/bin/mysqld'_ZL25wsrep_replication_processP3THD+0x50 [0x7c4420]
/mysql/bin/mysqld'start_wsrep_THD+0x33b [0x7af54b]
/lib/amd64/libc.so.1'_thrp_setup+0x8a [0xfffffd7fff25f46a]
/lib/amd64/libc.so.1'_lwp_start+0x0 [0xfffffd7fff25f780]
Please read http://dev.mysql.com/doc/refman/5.1/en/resolve-stack-dump.html
and follow instructions on how to resolve the stack trace.
Resolved stack trace is much more helpful in diagnosing the
problem, so please do resolve it

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0): is an invalid pointer
Connection ID (thread ID): 16
Status: NOT_KILLED

You may download the Percona XtraDB Cluster operations manual by visiting
http://www.percona.com/software/percona-xtradb-cluster/. You may find information
in the manual which will help you identify the cause of the crash.
Writing a core file

Revision history for this message
Jervin R (revin) wrote :

How to reproduce (not consistent but at least happened twice), cut off node2 from the cluster via iptables/ipfilter for about 60seconds or so. Then let it reconnect and IST, the crash happens. I've tried to reproduce this on CentOS RPMs to no avail.

mysql> show global variables like '%version%';
+-------------------------+---------------------------------------+
| Variable_name | Value |
+-------------------------+---------------------------------------+
| innodb_version | 5.6.15-60.3 |
| protocol_version | 10 |
| slave_type_conversions | |
| version | 5.6.15-log |
| version_comment | Source distribution, wsrep_25.4.rXXXX |
| version_compile_machine | x86_64 |
| version_compile_os | solaris11 |
+-------------------------+---------------------------------------+
7 rows in set (0.00 sec)

Revision history for this message
Jervin R (revin) wrote :

[client]
port = 3306
socket = /tmp/mysql.sock
default-character-set = utf8mb4

[mysqld]
# Cluster configuration
wsrep_provider=/mysql/lib/libgalera_smm.so
wsrep_cluster_address = "gcomm://192.168.72.8,192.168.72.17,192.168.72.16"
wsrep_node_address=192.168.72.8
wsrep_cluster_name = xxxxxxx_pxc_primary
wsrep_node_name = node1
wsrep_sst_method = xtrabackup-v2
wsrep_sst_auth = sst:xxxxxxxxxxxxxxxxxxx
wsrep_slave_threads = 16
wsrep_auto_increment_control = OFF
wsrep_provider_options='gcache.size=16G; evs.keepalive_period = PT3S; evs.suspect_timeout = PT45S; evs.inactive_timeout = PT1M; evs.install_timeout = PT1M'

# The MySQL server
bind-address=192.168.72.8
core-file
user = mysql
port = 3306
basedir = /mysql
datadir = /databases
socket = /tmp/mysql.sock
default-storage-engine = innodb
skip-external-locking
log_warnings=1
skip_name_resolve
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci
group_concat_max_len = 1000000
log-error = /var/log/mysql/error.log

# InnoDB settings
innodb_autoinc_lock_mode=2
innodb_flush_log_at_trx_commit=0
innodb_data_home_dir = /databases
innodb_log_group_home_dir = /databases
innodb_data_file_path = ibdata1:100M:autoextend
innodb_buffer_pool_size = 140G
innodb_buffer_pool_instances = 8
innodb_log_file_size = 2G
innodb_buffer_pool_load_at_startup=ON
innodb_buffer_pool_dump_at_shutdown=ON
# allows for row_format=compressed
innodb_file_format=Barracuda
innodb_log_buffer_size = 64M
innodb_lock_wait_timeout = 50
innodb_file_per_table
innodb_doublewrite = 0
innodb_io_capacity = 600
innodb_read_io_threads = 12
innodb_write_io_threads = 12
innodb_stats_on_metadata = OFF

# Slow query log settings
# The default logs all full table scans,tmp tables,filesorts on disk queries
#use_global_long_query_time = 1
#long_query_time = 0.5
slow_query_log_file = /var/log/mysql/slowquery.log
slow_query_log = 1
long_query_time = 3
log_slow_filter = "full_scan,tmp_table_on_disk,filesort_on_disk"
log_slow_verbosity = "full"

# Other general MySQL settings
sync_binlog = 0
query_cache_type = 0
query_cache_size = 0
thread_cache_size = 1000
back_log = 1024
innodb_thread_concurrency = 64
tmpdir = /var/tmp
max_connections = 5000
max_allowed_packet = 24M
max_join_size = 4294967295
net_buffer_length = 2K
thread_stack = 512K
tmp_table_size = 64M
max_heap_table_size = 64M

[mysqldump]
default-character-set = utf8mb4
quick
max_allowed_packet = 16M

[mysql]
default-character-set = utf8mb4
no-auto-rehash
# Remove the next comment character if you are not familiar with SQL
#safe-updates

[mysqld_safe]

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

Looks like it is crashing in process_state_req at

    char* const tmp(strndup(reinterpret_cast<const char*>(streq->sst_req()),
                            streq->sst_len()));

(may be null check required?)

Since I see " Writing a core file" there, is it possible to obtain a 'thread apply all bt' and 'thread apply all bt full' from this? (or from a fresh core)

summary: - Donor Node Crash when Re-joining After Brief Disconnect
+ Donor Node Crash when Re-joining After Brief Disconnect on Solaris
Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

Adding galera here since it is crashing in galera.

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

Also, logs with --wsrep-debug and debug=1 would be good. (for debug=1 make sure to build with latest galera-3).

Revision history for this message
Jervin R (revin) wrote :

Raghu,

There was a core file, however getting a backtrace was problematic - on the crashing thread we can only get this:

Thread 51 (LWP 53 ):
#0 0xfffffd7fff26698a in _lwp_kill () from /lib/64/libc.so.1
No symbol table info available.
#1 0xfffffd7fff25d2ff in thr_kill () from /lib/64/libc.so.1
No symbol table info available.
#2 0x00000000008690c2 in handle_fatal_signal (sig=11) at /var/tmp/tmp.OraAqW/Percona-XtraDB-Cluster-5.6.15/sql/signal_handler.cc:254
curr_time = <optimized out>
thrs = <optimized out>
mins = <optimized out>
hrs_buf = "03"
hrs = <optimized out>
mins_buf = "36"
secs_buf = "30"
tmins = <optimized out>
thd = 0x2a64f1560
#3 0xfffffd7fff25f846 in __sighndlr () from /lib/64/libc.so.1
No symbol table info available.
#4 0xfffffd7fff25269b in call_user_handler () from /lib/64/libc.so.1
No symbol table info available.
#5 0xfffffd7fff252a0e in sigacthandler () from /lib/64/libc.so.1
No symbol table info available.
#6 0xffffffffffffffff in ?? ()
No symbol table info available.
#7 0x000000000000000b in ?? ()
No symbol table info available.
#8 0x0000000000000000 in ?? ()
No symbol table info available.

Obviously the stacktrace from the crash itself is far better. A full backtrace is available on #39786 note #4 and attachment #4.

Revision history for this message
Alex Yurchenko (ayurchen) wrote :

At first glance it looks like a bug in strndup() implementation on Solaris. Where does strlen() come from?

If core does not work, it may be helpful to add

log_info << "strndup(" << streq->sst_req() << ", " << streq->sst_len() << ")";

just before that line.

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

The full backtraces attached in the issue are also incomplete.

........

57 LWP 59 0xfffffd7fff25f7c7 in __lwp_park () from /lib/64/libc.so.1
56 LWP 58 0xfffffd7fff25f7c7 in __lwp_park () from /lib/64/libc.so.1
55 LWP 57 0xfffffd7fff25f7c7 in __lwp_park () from /lib/64/libc.so.1
54 LWP 56 0xfffffd7fff25f7c7 in __lwp_park () from /lib/64/libc.so.1
53 LWP 55 0xfffffd7fff25f7c7 in __lwp_park () from /lib/64/libc.so.1
52 LWP 54 0xfffffd7fff25f7c7 in __lwp_park () from /lib/64/libc.so.1
51 LWP 53 0xfffffd7fff26698a in _lwp_kill () from /lib/64/libc.so.1
50 LWP 52 0xfffffd7fff25f7c7 in __lwp_park () from /lib/64/libc.so.1
49 LWP 51 0xfffffd7fff25f7c7 in __lwp_park () from /lib/64/libc.so.1
48 LWP 50 0xfffffd7fff25f7c7 in __lwp_park () from /lib/64/libc.so.1
47 LWP 49 0xfffffd7fff25f7c7 in __lwp_park () from /lib/64/libc.so.1
46 LWP 48 0xfffffd7fff25f7c7 in __lwp_park () from /lib/64/libc.so.1
45 LWP 47 0xfffffd7fff25f7c7 in __lwp_park () from /lib/64/libc.so.1
44 LWP 46 0xfffffd7fff25f7c7 in __lwp_park () from /lib/64/libc.so.1
43 LWP 45 0xfffffd7fff265b9a in __sigtimedwait () from /lib/64/libc.so
..............

Do the binaries have any debug symbols or they stripped? The backtraces seem broken.

A better way, may be to run mysqld from gdb

gdb --args mysqld ..... --gdb

Make it crash with test mentioned above and then do the 'bt' 'bt full' and 'thread apply all bt full' .

Using the log_info provided by Alex and rebuilding galera should also help.

Revision history for this message
Jervin R (revin) wrote :

Raghu,

The binaries are not stripped as you can see from the log on the description of this bug report. However when working on the core file, the symbols are incomplete for some reason - you should see the errors when trying GDB on the core file from one of the notes on the issue I referenced.

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

Yes, I can see the errors but they are not helpful, as I mentioned earlier, using the log_info or running it from gdb may help

Regarding these errors,
""
warning: Couldn't find general-purpose registers in core file.

warning: Couldn't find general-purpose registers in core file.
"on" or "off" expected.
Cannot access memory at address 0x0
""

This could mean gdb/binaries may be broken on that platform, needs to be investigated there (not related to PXC/galera).

Changed in percona-xtradb-cluster:
status: New → Incomplete
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1635

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.