Random crash on the same node

Bug #1327763 reported by dannie on 2014-06-08
48
This bug affects 10 people
Affects Status Importance Assigned to Milestone
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Fix Released
High
Krunal Bauskar

Bug Description

This is the third time the same node crashes.

First 2 times were consistent at 2am, I then upgraded from Ubuntu Precise to Trusty to see if that would change anything, and now I'm running with a compiled version of Xtradb Cluster 5.6.17-65.0 on both servers (third is a garbd instance).

Memory does not seem to be an issue http://puu.sh/9kaaQ/c0061882d1.png
Total memory 16GB.
innodb_buffer_pool_size = 8G

2014-06-08 13:03:19 27080 [Warning] WSREP: BF applier failed to open_and_lock_tables: 1615, fatal: 0 wsrep = (exec_mode: 1 conflict_state: 5 seqno: 10486174)
2014-06-08 13:03:19 27080 [Warning] WSREP: RBR event 3 Update_rows apply warning: 1615, 10486174
2014-06-08 13:03:19 27080 [Warning] WSREP: Failed to apply app buffer: seqno: 10486174, status: 1
         at galera/src/trx_handle.cpp:apply():340
Retrying 2th time
2014-06-08 13:03:19 27080 [Warning] WSREP: BF applier failed to open_and_lock_tables: 1615, fatal: 0 wsrep = (exec_mode: 1 conflict_state: 5 seqno: 10486174)
2014-06-08 13:03:19 27080 [Warning] WSREP: RBR event 3 Update_rows apply warning: 1615, 10486174
2014-06-08 13:03:19 27080 [Warning] WSREP: Failed to apply app buffer: seqno: 10486174, status: 1
         at galera/src/trx_handle.cpp:apply():340
Retrying 3th time
2014-06-08 13:03:19 27080 [Warning] WSREP: BF applier failed to open_and_lock_tables: 1615, fatal: 0 wsrep = (exec_mode: 1 conflict_state: 5 seqno: 10486174)
2014-06-08 13:03:19 27080 [Warning] WSREP: RBR event 3 Update_rows apply warning: 1615, 10486174
2014-06-08 13:03:19 27080 [Warning] WSREP: Failed to apply app buffer: seqno: 10486174, status: 1
         at galera/src/trx_handle.cpp:apply():340
Retrying 4th time
2014-06-08 13:03:19 27080 [Warning] WSREP: BF applier failed to open_and_lock_tables: 1615, fatal: 0 wsrep = (exec_mode: 1 conflict_state: 5 seqno: 10486174)
2014-06-08 13:03:19 27080 [Warning] WSREP: RBR event 3 Update_rows apply warning: 1615, 10486174
2014-06-08 13:03:19 27080 [Warning] WSREP: failed to replay trx: source: 8739bc85-ec9d-11e3-86e7-6b013cd9f468 version: 3 local: 1 state: REPLAYING flags: 1 conn_id: 746343 trx_id: 4385353381 seqnos (l: 2978809, g: 10486174, s: 10486166,
2014-06-08 13:03:19 27080 [Warning] WSREP: Failed to apply trx 10486174 4 times
2014-06-08 13:03:19 27080 [ERROR] WSREP: trx_replay failed for: 6, query: void
2014-06-08 13:03:19 27080 [ERROR] Aborting

2014-06-08 13:03:21 27080 [Note] WSREP: killing local connection: 746345
2014-06-08 13:03:21 27080 [Note] WSREP: killing local connection: 746348
2014-06-08 13:03:21 27080 [Note] WSREP: killing local connection: 746349
2014-06-08 13:03:21 27080 [Note] WSREP: Closing send monitor...
2014-06-08 13:03:21 27080 [Note] WSREP: Closed send monitor.
2014-06-08 13:03:21 27080 [Note] WSREP: gcomm: terminating thread
2014-06-08 13:03:21 27080 [Note] WSREP: gcomm: joining thread
2014-06-08 13:03:21 27080 [Note] WSREP: gcomm: closing backend
2014-06-08 13:03:22 27080 [Note] WSREP: view(view_id(NON_PRIM,04544223-ed52-11e3-b428-eebb72a5cca2,50) memb {
        8739bc85-ec9d-11e3-86e7-6b013cd9f468,0
} joined {
} left {
} partitioned {
        04544223-ed52-11e3-b428-eebb72a5cca2,0
        09f6ccfa-e9c8-11e3-a608-6eeac628a08a,0
})
2014-06-08 13:03:22 27080 [Note] WSREP: view((empty))
2014-06-08 13:03:22 27080 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2014-06-08 13:03:22 27080 [Note] WSREP: gcomm: closed
2014-06-08 13:03:22 27080 [Note] WSREP: Flow-control interval: [16, 16]
2014-06-08 13:03:22 27080 [Note] WSREP: Received NON-PRIMARY.
2014-06-08 13:03:22 27080 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 10486223)
2014-06-08 13:03:22 27080 [Note] WSREP: Received self-leave message.
2014-06-08 13:03:22 27080 [Note] WSREP: Flow-control interval: [0, 0]
2014-06-08 13:03:22 27080 [Note] WSREP: Received SELF-LEAVE. Closing connection.
2014-06-08 13:03:22 27080 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 10486223)
2014-06-08 13:03:22 27080 [Note] WSREP: RECV thread exiting 0: Success
2014-06-08 13:03:22 27080 [Note] WSREP: recv_thread() joined.
2014-06-08 13:03:22 27080 [Note] WSREP: Closing replication queue.
2014-06-08 13:03:22 27080 [Note] WSREP: Closing slave action queue.
2014-06-08 13:03:22 27080 [Note] WSREP: Service disconnected.
2014-06-08 13:03:22 27080 [Note] WSREP: rollbacker thread exiting
2014-06-08 13:03:23 27080 [Note] WSREP: Some threads may fail to exit.
2014-06-08 13:03:23 27080 [Note] Binlog end
2014-06-08 13:03:23 27080 [Note] Shutting down plugin 'partition'
2014-06-08 13:03:23 27080 [Note] Shutting down plugin 'PERFORMANCE_SCHEMA'
2014-06-08 13:03:23 27080 [Note] Shutting down plugin 'INNODB_CHANGED_PAGES'
2014-06-08 13:03:23 27080 [Note] Shutting down plugin 'INNODB_SYS_DATAFILES'
2014-06-08 13:03:23 27080 [Note] Shutting down plugin 'INNODB_SYS_TABLESPACES'
2014-06-08 13:03:23 27080 [Note] Shutting down plugin 'INNODB_SYS_FOREIGN_COLS'
2014-06-08 13:03:23 27080 [Note] Shutting down plugin 'INNODB_SYS_FOREIGN'
2014-06-08 13:03:23 27080 [Note] Shutting down plugin 'INNODB_SYS_FIELDS'
2014-06-08 13:03:23 27080 [Note] Shutting down plugin 'INNODB_SYS_COLUMNS'
2014-06-08 13:03:23 27080 [Note] Shutting down plugin 'INNODB_SYS_INDEXES'
2014-06-08 13:03:23 27080 [Note] Shutting down plugin 'INNODB_SYS_TABLESTATS'
2014-06-08 13:03:23 27080 [Note] Shutting down plugin 'INNODB_SYS_TABLES'
2014-06-08 13:03:23 27080 [Note] Shutting down plugin 'INNODB_FT_INDEX_TABLE'
2014-06-08 13:03:23 27080 [Note] Shutting down plugin 'INNODB_FT_INDEX_CACHE'
2014-06-08 13:03:23 27080 [Note] Shutting down plugin 'INNODB_FT_CONFIG'
2014-06-08 13:03:23 27080 [Note] Shutting down plugin 'INNODB_FT_BEING_DELETED'
2014-06-08 13:03:23 27080 [Note] Shutting down plugin 'INNODB_FT_DELETED'
2014-06-08 13:03:23 27080 [Note] Shutting down plugin 'INNODB_FT_DEFAULT_STOPWORD'
2014-06-08 13:03:23 27080 [Note] Shutting down plugin 'INNODB_METRICS'
2014-06-08 13:03:23 27080 [Note] Shutting down plugin 'INNODB_BUFFER_POOL_STATS'
2014-06-08 13:03:23 27080 [Note] Shutting down plugin 'INNODB_BUFFER_PAGE_LRU'
2014-06-08 13:03:23 27080 [Note] Shutting down plugin 'INNODB_BUFFER_PAGE'
2014-06-08 13:03:23 27080 [Note] Shutting down plugin 'INNODB_CMP_PER_INDEX_RESET'
2014-06-08 13:03:23 27080 [Note] Shutting down plugin 'INNODB_CMP_PER_INDEX'
2014-06-08 13:03:23 27080 [Note] Shutting down plugin 'INNODB_CMPMEM_RESET'
2014-06-08 13:03:23 27080 [Note] Shutting down plugin 'INNODB_CMPMEM'
2014-06-08 13:03:23 27080 [Note] Shutting down plugin 'INNODB_CMP_RESET'
2014-06-08 13:03:23 27080 [Note] Shutting down plugin 'INNODB_CMP'
2014-06-08 13:03:23 27080 [Note] Shutting down plugin 'INNODB_LOCK_WAITS'
2014-06-08 13:03:23 27080 [Note] Shutting down plugin 'INNODB_LOCKS'
2014-06-08 13:03:23 27080 [Note] Shutting down plugin 'INNODB_TRX'
2014-06-08 13:03:23 27080 [Note] Shutting down plugin 'XTRADB_RSEG'
2014-06-08 13:03:23 27080 [Note] Shutting down plugin 'XTRADB_INTERNAL_HASH_TABLES'
2014-06-08 13:03:23 27080 [Note] Shutting down plugin 'XTRADB_READ_VIEW'
2014-06-08 13:03:23 27080 [Note] Shutting down plugin 'InnoDB'
2014-06-08 13:03:23 27080 [Note] InnoDB: FTS optimize thread exiting.
2014-06-08 13:03:23 27080 [Note] InnoDB: Starting shutdown...
11:03:23 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Please help us make Percona XtraDB Cluster better by reporting any
bugs at https://bugs.launchpad.net/percona-xtradb-cluster

key_buffer_size=12582912
read_buffer_size=131072
max_used_connections=68
max_threads=502
thread_count=12
connection_count=12
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 212606 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x10a04b10
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7fb778271e20 thread_stack 0x30000
/usr/local/mysql/bin/mysqld(my_print_stacktrace+0x2c)[0x93b21c]
/usr/local/mysql/bin/mysqld(handle_fatal_signal+0x35f)[0x6967bf]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x10340)[0x7fb9e97a9340]
/usr/local/mysql/bin/mysqld[0x6b716b]
/usr/local/mysql/bin/mysqld[0x6b8049]
/usr/local/mysql/bin/mysqld[0x6b831e]
/usr/local/mysql/bin/mysqld[0x6a9f59]
/usr/local/mysql/bin/mysqld[0x6a9474]
/usr/local/mysql/bin/mysqld(_Z16acl_authenticateP3THDj+0x1c3)[0x6bd773]
/usr/local/mysql/bin/mysqld[0x6eead1]
/usr/local/mysql/bin/mysqld(_Z16login_connectionP3THD+0x3d)[0x6f008d]
/usr/local/mysql/bin/mysqld(_Z22thd_prepare_connectionP3THD+0x24)[0x6f0864]
/usr/local/mysql/bin/mysqld(_Z24do_handle_one_connectionP3THD+0x137)[0x6f0b97]
/usr/local/mysql/bin/mysqld(handle_one_connection+0x39)[0x6f0de9]
/usr/local/mysql/bin/mysqld(pfs_spawn_thread+0x140)[0xb9abd0]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8182)[0x7fb9e97a1182]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fb9e8cae30d]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0): is an invalid pointer
Connection ID (thread ID): 746356
Status: NOT_KILLED

You may download the Percona XtraDB Cluster operations manual by visiting
http://www.percona.com/software/percona-xtradb-cluster/. You may find information
in the manual which will help you identify the cause of the crash.
140608 13:03:23 mysqld_safe Number of processes running now: 0
140608 13:03:23 mysqld_safe WSREP: not restarting wsrep node automatically
140608 13:03:23 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended

dannie (daniz) wrote :

Just to add, when starting the crashed node, it did a SST

140608 22:57:11 mysqld_safe WSREP: Recovered position fc34347d-e515-11e3-ab6a-5f5ace43d8d1:10486173
2014-06-08 22:57:11 0 [Note] WSREP: wsrep_start_position var submitted: 'fc34347d-e515-11e3-ab6a-5f5ace43d8d1:10486173'
2014-06-08 22:57:11 0 [Warning] TIMESTAMP with implicit DEFAULT value is deprecated. Please use --explicit_defaults_for_timestamp server option (see documentation for more details).
2014-06-08 22:57:11 7727 [Warning] Using pre 5.5 semantics to load error messages from /usr/local/mysql/share/english/.
2014-06-08 22:57:11 7727 [Warning] If this is not intended, refer to the documentation for valid usage of --lc-messages-dir and --language parameters.
2014-06-08 22:57:11 7727 [Note] WSREP: Read nil XID from storage engines, skipping position init
2014-06-08 22:57:11 7727 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib/libgalera_smm.so'
2014-06-08 22:57:11 7727 [Note] WSREP: wsrep_load(): Galera 3.5(r178) by Codership Oy <email address hidden> loaded successfully.
2014-06-08 22:57:11 7727 [Note] WSREP: CRC-32C: using hardware acceleration.
2014-06-08 22:57:11 7727 [Note] WSREP: Found saved state: 00000000-0000-0000-0000-000000000000:-1
2014-06-08 22:57:11 7727 [Note] WSREP: Passing config to GCS: base_host = 10.41.172.66; base_port = 4567; cert.log_conflicts = no; debug = no; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT
2014-06-08 22:57:11 7727 [Note] WSREP: Service thread queue flushed.

dannie (daniz) wrote :
Download full text (5.4 KiB)

Same thing happened today:

2014-06-09 10:03:41 7727 [Warning] WSREP: BF applier failed to open_and_lock_tables: 1615, fatal: 0 wsrep = (exec_mode: 1 conflict_state: 5 seqno: 11354082)
2014-06-09 10:03:41 7727 [Warning] WSREP: RBR event 3 Update_rows apply warning: 1615, 11354082
2014-06-09 10:03:41 7727 [Warning] WSREP: Failed to apply app buffer: seqno: 11354082, status: 1
         at galera/src/trx_handle.cpp:apply():340
Retrying 2th time
2014-06-09 10:03:41 7727 [Warning] WSREP: BF applier failed to open_and_lock_tables: 1615, fatal: 0 wsrep = (exec_mode: 1 conflict_state: 5 seqno: 11354082)
2014-06-09 10:03:41 7727 [Warning] WSREP: RBR event 3 Update_rows apply warning: 1615, 11354082
2014-06-09 10:03:41 7727 [Warning] WSREP: Failed to apply app buffer: seqno: 11354082, status: 1
         at galera/src/trx_handle.cpp:apply():340
Retrying 3th time
2014-06-09 10:03:41 7727 [Warning] WSREP: BF applier failed to open_and_lock_tables: 1615, fatal: 0 wsrep = (exec_mode: 1 conflict_state: 5 seqno: 11354082)
2014-06-09 10:03:41 7727 [Warning] WSREP: RBR event 3 Update_rows apply warning: 1615, 11354082
2014-06-09 10:03:41 7727 [Warning] WSREP: Failed to apply app buffer: seqno: 11354082, status: 1
         at galera/src/trx_handle.cpp:apply():340
Retrying 4th time
2014-06-09 10:03:41 7727 [Warning] WSREP: BF applier failed to open_and_lock_tables: 1615, fatal: 0 wsrep = (exec_mode: 1 conflict_state: 5 seqno: 11354082)
2014-06-09 10:03:41 7727 [Warning] WSREP: RBR event 3 Update_rows apply warning: 1615, 11354082
2014-06-09 10:03:41 7727 [Warning] WSREP: failed to replay trx: source: 762da368-ef4f-11e3-935b-f3bb661a8671 version: 3 local: 1 state: REPLAYING flags: 1 conn_id: 118117 trx_id: 4430490127 seqnos (l: 302245, g: 11354082, s: 11354076, d: 11354081, ts: 346572660163671)
2014-06-09 10:03:41 7727 [Warning] WSREP: Failed to apply trx 11354082 4 times
2014-06-09 10:03:41 7727 [ERROR] WSREP: trx_replay failed for: 6, query: void
2014-06-09 10:03:41 7727 [ERROR] Aborting

2014-06-09 10:03:43 7727 [Note] WSREP: killing local connection: 118122
2014-06-09 10:03:43 7727 [Note] WSREP: killing local connection: 118120
2014-06-09 10:03:43 7727 [Note] WSREP: Closing send monitor...
2014-06-09 10:03:43 7727 [Note] WSREP: Closed send monitor.
2014-06-09 10:03:43 7727 [Note] WSREP: gcomm: terminating thread
2014-06-09 10:03:43 7727 [Note] WSREP: gcomm: joining thread
2014-06-09 10:03:43 7727 [Note] WSREP: gcomm: closing backend
2014-06-09 10:03:44 7727 [Note] WSREP: view(view_id(NON_PRIM,04544223-ed52-11e3-b428-eebb72a5cca2,52) memb {
        762da368-ef4f-11e3-935b-f3bb661a8671,0
} joined {
} left {
} partitioned {
        04544223-ed52-11e3-b428-eebb72a5cca2,0
        09f6ccfa-e9c8-11e3-a608-6eeac628a08a,0
})
2014-06-09 10:03:44 7727 [Note] WSREP: view((empty))
2014-06-09 10:03:44 7727 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2014-06-09 10:03:44 7727 [Note] WSREP: gcomm: closed
2014-06-09 10:03:44 7727 [Note] WSREP: Flow-control interval: [16, 16]
2014-06-09 10:03:44 7727 [Note] WSREP: Received NON-PRIMARY.
2014-06-09 10:03:44 7727 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 11354114)
2014-06-09 10:03:44 7727 [...

Read more...

@dannie,

Can you resolve the stack trace with debuginfo installed? It would be good to
get a bt, bt full and thread apply all bt full if you can obtain a core dump.

dannie (daniz) wrote :

Hi,

I can try if you give me some more details how I would go about doing this.

Also want to mention that it has not happened the last couple of days for some reason. Could be cause I killed percona-agent, but it's very hard to say. If you let me know how I can add more debug info, I'll set it up to try to capture it next time it happens.

dannie (daniz) wrote :

Same again,

hread pointer: 0xf8ba300
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7f681f7e5e20 thread_stack 0x30000
/usr/local/mysql/bin/mysqld(my_print_stacktrace+0x2c)[0x93b21c]
/usr/local/mysql/bin/mysqld(handle_fatal_signal+0x35f)[0x6967bf]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x10340)[0x7f6a93340340]
/usr/local/mysql/bin/mysqld[0x6b716b]
/usr/local/mysql/bin/mysqld[0x6b8049]
/usr/local/mysql/bin/mysqld[0x6b831e]
/usr/local/mysql/bin/mysqld[0x6a9f59]
/usr/local/mysql/bin/mysqld[0x6a9474]
/usr/local/mysql/bin/mysqld(_Z16acl_authenticateP3THDj+0x1c3)[0x6bd773]
/usr/local/mysql/bin/mysqld[0x6eead1]
/usr/local/mysql/bin/mysqld(_Z16login_connectionP3THD+0x3d)[0x6f008d]
/usr/local/mysql/bin/mysqld(_Z22thd_prepare_connectionP3THD+0x24)[0x6f0864]
/usr/local/mysql/bin/mysqld(_Z24do_handle_one_connectionP3THD+0x137)[0x6f0b97]
/usr/local/mysql/bin/mysqld(handle_one_connection+0x39)[0x6f0de9]
/usr/local/mysql/bin/mysqld(pfs_spawn_thread+0x140)[0xb9abd0]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8182)[0x7f6a93338182]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f6a9284530d]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0): is an invalid pointer
Connection ID (thread ID): 1760219
Status: NOT_KILLED

more info on the debuginfo would be great.

@dannie,

Do you have an empty users table? Please check https://bugs.launchpad.net/percona-server/+bug/1259833 (last few comments). I will merge this bug into that if so.

dannie (daniz) wrote :

Hi

sadly no

ls -lh mysql/user*
-rw-rw---- 1 mysql mysql 11K Jun 19 17:33 mysql/user.frm
-rw-rw---- 1 mysql mysql 1.2K Jun 19 17:33 mysql/user.MYD
-rw-rw---- 1 mysql mysql 2.0K Jun 19 17:33 mysql/user.MYI

dannie (daniz) wrote :

Should I start mysql in gdb, or compile it with some debug mode first?

Bert Hutzler (bhutzler) wrote :

We have similar issues, except that it were different nodes that crashed randomly.
Just updatet this morning to Percona-XtraDB-Cluster-full-56.x86_64 1:5.6.21-25.8.938.el6 and hoped so to get rid of that crashes.
But about 2 hours after the update it happened again.
I attached the relevant log lines.
I'd happily provide you with more information as you may request.
This is a 3 node cluster on CentOS 6.5.

@Bert, @dannie,

Can you enable coredumps for mysql and provide backtrace from that. You can use https://gist.github.com/ronin13/8d403c2f88826fcc7aba as the gdb script for it.

Also, general log (decoded binlog if you have it), my.cnf and any steps to reproduce this would be great.

Also, do you run any prepared statements? http://bugs.mysql.com/bug.php?id=42041 has some details and workarounds for 1615 (ER_NEED_REPREPARE) - such as increasing table_definition_cache.

You can use http://www.percona.com/blog/2011/08/26/getting-mysql-core-file-on-linux/ to get coredump.

Make sure to install debuginfo package for this.

Also, provide complete error log if possible.

Bert Hutzler (bhutzler) wrote :

Uhh, this was not quit trivial for me to figure out. To help followers, I recommend the following site to learn how to get a core dump on RedHat aka CentOS: http://www.unixmen.com/how-to-enable-core-dumps-in-rhel6/
This worked for me.
I applied this on a test machine, and after also figuring out how to use that gdb script, I got the attachement.

Before sending you the whole thing from a crashing production machine, I'd like to ask, if this is, what you expect or if I have to use another mysqld binary because of the messages of the missing symbols. Meanwhile here's only the gdb.txt

debuginfo package for XtraDB Cluster is installed. Do I have to do sth. to kind of activate this or so?

Bert Hutzler (bhutzler) wrote :

It's mysterious. Now we run a 5 node cluster. The first one crashed yesterday 7 times, while all the others run approximately 4 days without interruption...
I'd like to provide you guys with more debug info. So I'm waiting for feedback on my last post...

dannie (daniz) wrote :

I'm back with some new info. After the last update to 5.6.22-72.0-56 every node crashes within 3-4h, and at random times. After some help from the percona crew to get the debug info this is what I got, hope it helps.

Let me know if there is anything else I can post

Thank you for your answer, and sorry for being so late :(
but we're very occupied... ...maybe you're also familiar with this
situation: more work, less workers...

Well, I think, we already could produce the right output, and also being
in contact with percona, they provided a bugfix, which seems to work.
All in all i think it's a combination of several factors, that we could
reduce the crashes to a very minimum of about less than once a week on
one of the nodes.

What we did is:
1. put the database on SSDs
2. a dedicated switch for synchronization of the nodes
3. doing writes mainly on only one node

I will put this info also on launchpad,
so maybe some other people can benefit from...

Mit freundlichen Grüßen (kind regards)

Bert Hutzler

Quoka GmbH
EDV Abteilung
Chemiestraße 14-15
68623 Lampertheim

Telefon: +49 (0) 6206 / 931-128
Telefax: +49 (0) 6206 / 931-149
E-Mail: mailto:<email address hidden>
Internet: http://www.quoka.de

AG Darmstadt - Reg. Abt. Lampertheim - HRB 61482 - USt-ID-Nr. DE 175 723 521 - Geschäftsführung: Patrick Günter

On 22.03.2015 22:14, dannie wrote:
> I'm back with some new info. After the last update to 5.6.22-72.0-56
> every node crashes within 3-4h, and at random times. After some help
> from the percona crew to get the debug info this is what I got, hope it
> helps.
>
> Let me know if there is anything else I can post
>
>
> ** Attachment added: "crash.log"
> https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1327763/+attachment/4352525/+files/crash.log
>

Davestj (davestj) wrote :
Download full text (15.9 KiB)

I am also experiencing this issue, would be interesting to find the cause or if this can be accepted and worked.

Cluster details

Percona XtraDB galera cluster x3 nodes each node 30gb RAM

Our rpm list
| <email address hidden>:/rootdba -> rpm -qa | grep Percona
Percona-Server-shared-51-5.1.73-rel14.12.624.rhel6.x86_64
Percona-XtraDB-Cluster-client-56-5.6.24-25.11.1.el6.x86_64
Percona-XtraDB-Cluster-galera-3-3.11-1.rhel6.x86_64
Percona-XtraDB-Cluster-server-56-5.6.24-25.11.1.el6.x86_64
Percona-XtraDB-Cluster-shared-56-5.6.24-25.11.1.el6.x86_64

OS Version: CentOS release 6.6 (Final)

Crash appears to be random but consistent, needs manual start to recover.
So far the whole cluster has not entirely crashed, the other 2 nodes have continued and allows rejoin after the crashed node is manually started, sometimes the first start attempt fails and a 2nd "service mysql start" command is needed, which then detects the stale sst file and re-initializes things.

Mysql error Log snapshot
----------------
2015-06-18 18:26:13 18275 [Warning] WSREP: BF applier failed to open_and_lock_tables: 1615, fatal: 0 wsrep = (exec_mode: 1 conflict_state: 5 seqno: 1704468)
2015-06-18 18:26:13 18275 [Warning] WSREP: RBR event 3 Update_rows apply warning: 1615, 1704468
2015-06-18 18:26:13 18275 [Warning] WSREP: Failed to apply app buffer: seqno: 1704468, status: 1
at galera/src/trx_handle.cpp:apply():351
Retrying 2th time
2015-06-18 18:26:13 18275 [Warning] WSREP: BF applier failed to open_and_lock_tables: 1615, fatal: 0 wsrep = (exec_mode: 1 conflict_state: 5 seqno: 1704468)
2015-06-18 18:26:13 18275 [Warning] WSREP: RBR event 3 Update_rows apply warning: 1615, 1704468
2015-06-18 18:26:13 18275 [Warning] WSREP: Failed to apply app buffer: seqno: 1704468, status: 1
at galera/src/trx_handle.cpp:apply():351
Retrying 3th time
2015-06-18 18:26:13 18275 [Warning] WSREP: BF applier failed to open_and_lock_tables: 1615, fatal: 0 wsrep = (exec_mode: 1 conflict_state: 5 seqno: 1704468)
2015-06-18 18:26:13 18275 [Warning] WSREP: RBR event 3 Update_rows apply warning: 1615, 1704468
2015-06-18 18:26:13 18275 [Warning] WSREP: Failed to apply app buffer: seqno: 1704468, status: 1
at galera/src/trx_handle.cpp:apply():351
Retrying 4th time
2015-06-18 18:26:13 18275 [Warning] WSREP: BF applier failed to open_and_lock_tables: 1615, fatal: 0 wsrep = (exec_mode: 1 conflict_state: 5 seqno: 1704468)
2015-06-18 18:26:13 18275 [Warning] WSREP: RBR event 3 Update_rows apply warning: 1615, 1704468
2015-06-18 18:26:13 18275 [Warning] WSREP: failed to replay trx: source: 401397a7-15b4-11e5-8b99-87d715e86d59 version: 3 local: 1 state: REPLAYING flags: 1 con
n_id: 11318 trx_id: 16121482 seqnos (l: 18306, g: 1704468, s: 1704466, d: 1704467, ts: 3186938643214857)
2015-06-18 18:26:13 18275 [Warning] WSREP: Failed to apply trx 1704468 4 times
2015-06-18 18:26:13 18275 [ERROR] WSREP: trx_replay failed for: 6, query: void
2015-06-18 18:26:13 18275 [ERROR] Aborting

2015-06-18 18:26:15 18275 [Note] WSREP: Closing send monitor...
2015-06-18 18:26:15 18275 [Note] WSREP: Closed send monitor.
2015-06-18 18:26:15 18275 [Note] WSREP: gcomm: terminating thread
2015-06-18 18:26:15 18275 [Note] WSREP: gcomm: joining th...

dannie (daniz) wrote :

I still have an occasional random shutdown, no one has commented on my debug-crash-log, not sure the bug is still seen as active or important.. Random crashes right, who cares :D

In all seriousness, random crashes can be hard to debug, but a comment what we can do to help more, or a notice of my log dump would be great.

Bert Hutzler (bhutzler) wrote :

As i already stated, we run a 5 node cluster with databases on SSDs, dedicated switch for sync and doing writes mainly on only one node.
Actually we're running Percona-XtraDBCluster 5.6.22 25.8.978.el6,
and still some nodes crash randomly but rather rarely (about once or twice a week)
I handled this with a script running every minute and restarting mysql if it's crashed...

We're productive and this runs ok for us, although we'd be glad of course getting totally rid of the random crashes...
And we're planning to update this week to the actual release 5.6.24-25.11.1.el6 this week.

I'll report changes, hopefully improvements after updating...

commit d4fa27b020df425a6421ae12a18de9306cf5c90d
Merge: 9d30f12 f06d17f
Author: Krunal Bauskar <email address hidden>
Date: Thu Feb 18 06:35:17 2016 +0530

    Merge pull request #90 from kbauskar/5.6-pxc-497

    - PXC#497: Warning: "BF applier failed to open_and_lock_tables"

commit f06d17fd75c2aee6dbdbecd81e0795be1017284a
Author: Krunal Bauskar <email address hidden>
Date: Mon Feb 15 22:15:47 2016 +0530

    - PXC#497: Warning: "BF applier failed to open_and_lock_tables"

      Issue:
      -----

      Let me explain it with an example. Would be simpler.

      Let's assume there are 2 pxc nodes and each node and following
      workload is being

      node-1: executes already prepare stmt and about to commit
      (update i > 2 and i < 5)

      node-2: executes some stmt that would conflict with node-1 (insert (3))
      inserting in gap.

      As per the protocol local trx is selected for rollback and so
      when node-1 get node-2 trx which is applier trx node-1 local trx that
      is prepare stmt execution is rolled back and re-tried using replay trx.

      Note: retried using replay trx effectively means only write-set
      is applied this makes it bit faster.
      Application of write-set uses the apply-log-event path which
      open and locks table so there is no need to keep track of
      table versioning change which is normally done (in normal path)
      by installing a reprepare observer.

      Solution:
      ---------

      * Avoid installing observer when prepare trx is being replayed
        as trx will open and lock the tables.

Changed in percona-xtradb-cluster:
assignee: nobody → Krunal Bauskar (krunal-bauskar)
importance: Undecided → High
status: New → Fix Committed
Przemek (pmalkowski) on 2016-03-07
tags: added: i66365
Changed in percona-xtradb-cluster:
milestone: none → 5.6.29-25.15
Changed in percona-xtradb-cluster:
status: Fix Committed → Fix Released

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1007

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.