Cluster failed because of "Writeset deserialization failed: Writeset checksum failed: 22 (Invalid argument)" error

Bug #1498098 reported by Manel Martinez
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Expired
Undecided
Unassigned

Bug Description

A few days ago the entire cluster, composed of three nodes in the same LAN, crashed (all nodes at the same time). The error on node db1 was this:

2015-09-16 14:08:25 11591 [ERROR] WSREP: RecordSet checksum does not match:
computed: 34072646 a1931ae0 ccb8a464 934ec6ae
found: f9f0f9c1 998b49f6 88a0fbf2 07691543: 22 (Invalid argument)
         at galerautils/src/gu_rset.cpp:checksum():392
2015-09-16 14:08:25 11591 [ERROR] WSREP: Writeset deserialization failed: Writeset checksum failed: 22 (Invalid arg
ument)
         at galera/src/write_set_ng.hpp:checksum_fin():813
         at galera/src/trx_handle.cpp:unserialize():268
WS flags: 1
Trx proto: 3
Trx source: 2bd1b8cd-47ef-11e5-a82a-06e2536d1c1c
Trx conn_id: 2898024
Trx trx_id: 957996414
Trx last_seen: 460969023
2015-09-16 14:08:25 11591 [ERROR] WSREP: Writeset checksum failed: 22 (Invalid argument)
         at galera/src/write_set_ng.hpp:checksum_fin():813
         at galera/src/trx_handle.cpp:unserialize():268
2015-09-16 14:08:25 11591 [ERROR] WSREP: unknown connection failure
2015-09-16 14:08:25 11591 [ERROR] WSREP: FSM: no such a transition REPLICATING -> ROLLED_BACK
14:08:25 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Please help us make Percona XtraDB Cluster better by reporting any
bugs at https://bugs.launchpad.net/percona-xtradb-cluster
key_buffer_size=16777216
read_buffer_size=131072
max_used_connections=21
max_threads=153
thread_count=12
connection_count=10
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 77444 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x73df000
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7f8439cece50 thread_stack 0x30000
/usr/sbin/mysqld(my_print_stacktrace+0x2c)[0x92a78c]
/usr/sbin/mysqld(handle_fatal_signal+0x352)[0x67ed42]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x10340)[0x7f851cd86340]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x39)[0x7f851c1c7cc9]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7f851c1cb0d8]
/usr/lib/libgalera_smm.so(_ZN6galera3FSMINS_9TrxHandle5StateENS1_10TransitionENS_10EmptyGuardENS_11EmptyActionEE8shift_toES2_+0x17c)[0x7f8519ee755c]
/usr/lib/libgalera_smm.so(_ZN6galera13ReplicatorSMM13post_rollbackEPNS_9TrxHandleE+0x26)[0x7f8519edda36]
/usr/lib/libgalera_smm.so(galera_post_rollback+0x48)[0x7f8519eeea88]
/usr/sbin/mysqld[0x7bf00d]
/usr/sbin/mysqld[0x7bf1b8]
/usr/sbin/mysqld(_Z15ha_rollback_lowP3THDb+0x8e)[0x5b0eee]
/usr/sbin/mysqld(_ZN13MYSQL_BIN_LOG8rollbackEP3THDb+0x9a)[0x8db3aa]
/usr/sbin/mysqld(_Z17ha_rollback_transP3THDb+0x48)[0x5b1008]
/usr/sbin/mysqld(_Z15ha_commit_transP3THDbb+0x21c)[0x5b150c]
/usr/sbin/mysqld(_Z17trans_commit_stmtP3THD+0x37)[0x7a3327]
/usr/sbin/mysqld(_Z21mysql_execute_commandP3THD+0x4d8)[0x70c788]
/usr/sbin/mysqld[0x7141f8]
/usr/sbin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcj+0x130f)[0x715acf]
/usr/sbin/mysqld(_Z10do_commandP3THD+0x1f2)[0x717ae2]
/usr/sbin/mysqld(_Z24do_handle_one_connectionP3THD+0x2ed)[0x6da4dd]
/usr/sbin/mysqld(handle_one_connection+0x39)[0x6da579]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8182)[0x7f851cd7e182]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f851c28b47d]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (7f84fd94c920): is an invalid pointer
Connection ID (thread ID): 2898024
Status: NOT_KILLED

You may download the Percona XtraDB Cluster operations manual by visiting
http://www.percona.com/software/percona-xtradb-cluster/. You may find information
in the manual which will help you identify the cause of the crash.
150916 14:08:27 mysqld_safe Number of processes running now: 0
150916 14:08:27 mysqld_safe WSREP: not restarting wsrep node automatically
150916 14:08:27 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended

These are the logs on db2 server:

2015-09-16 14:08:25 22475 [ERROR] WSREP: RecordSet checksum does not match:
computed: 34072646 a1931ae0 ccb8a464 934ec6ae
found: f9f0f9c1 998b49f6 88a0fbf2 07691543: 22 (Invalid argument)
         at galerautils/src/gu_rset.cpp:checksum():392
2015-09-16 14:08:25 22475 [ERROR] WSREP: Writeset deserialization failed: Writeset checksum failed: 22 (Invalid argument)
         at galera/src/write_set_ng.hpp:checksum_fin():813
         at galera/src/trx_handle.cpp:unserialize():268
WS flags: 0
Trx proto: 3
Trx source: 00000000-0000-0000-0000-000000000000
Trx conn_id: 18446744073709551615
Trx trx_id: 18446744073709551615
Trx last_seen: -1
2015-09-16 14:08:25 22475 [ERROR] WSREP: Writeset checksum failed: 22 (Invalid argument)
         at galera/src/write_set_ng.hpp:checksum_fin():813
         at galera/src/trx_handle.cpp:unserialize():268
2015-09-16 14:08:25 22475 [Note] WSREP: applier thread exiting (code:7)
2015-09-16 14:08:25 22475 [ERROR] WSREP: node consistency compromised, aborting
2015-09-16 14:08:25 22475 [Note] WSREP: starting shutdown
2015-09-16 14:08:25 22475 [Note] /usr/sbin/mysqld: Normal shutdown

2015-09-16 14:08:25 22475 [Note] WSREP: Stop replication
2015-09-16 14:08:25 22475 [Note] WSREP: Closing send monitor...
2015-09-16 14:08:25 22475 [Note] WSREP: Closed send monitor.
2015-09-16 14:08:25 22475 [Note] WSREP: gcomm: terminating thread
2015-09-16 14:08:25 22475 [Note] WSREP: gcomm: joining thread
2015-09-16 14:08:25 22475 [Note] WSREP: gcomm: closing backend
2015-09-16 14:08:25 22475 [Note] WSREP: declaring 2bd1b8cd at tcp://10.26.0.50:4567 stable
2015-09-16 14:08:25 22475 [Note] WSREP: forgetting 889a5742 (tcp://10.26.0.52:4567)

And these are db3 logs:

2015-09-16 14:08:25 18460 [ERROR] WSREP: RecordSet checksum does not match:
computed: 34072646 a1931ae0 ccb8a464 934ec6ae
found: f9f0f9c1 998b49f6 88a0fbf2 07691543: 22 (Invalid argument)
         at galerautils/src/gu_rset.cpp:checksum():392
2015-09-16 14:08:25 18460 [ERROR] WSREP: Writeset deserialization failed: Writeset checksum failed: 22 (Invalid argument)
         at galera/src/write_set_ng.hpp:checksum_fin():813
         at galera/src/trx_handle.cpp:unserialize():268
WS flags: 0
Trx proto: 3
Trx source: 00000000-0000-0000-0000-000000000000
Trx conn_id: 18446744073709551615
Trx trx_id: 18446744073709551615
Trx last_seen: -1
2015-09-16 14:08:25 18460 [ERROR] WSREP: Writeset checksum failed: 22 (Invalid argument)
         at galera/src/write_set_ng.hpp:checksum_fin():813
         at galera/src/trx_handle.cpp:unserialize():268
2015-09-16 14:08:25 18460 [Note] WSREP: applier thread exiting (code:7)
2015-09-16 14:08:25 18460 [ERROR] WSREP: node consistency compromised, aborting
2015-09-16 14:08:25 18460 [Note] WSREP: starting shutdown
2015-09-16 14:08:25 18460 [Note] /usr/sbin/mysqld: Normal shutdown

2015-09-16 14:08:25 18460 [Note] WSREP: Stop replication
2015-09-16 14:08:25 18460 [Note] WSREP: Closing send monitor...
2015-09-16 14:08:25 18460 [Note] WSREP: Closed send monitor.
2015-09-16 14:08:25 18460 [Note] WSREP: gcomm: terminating thread
2015-09-16 14:08:25 18460 [Note] WSREP: gcomm: joining thread
2015-09-16 14:08:25 18460 [Note] WSREP: gcomm: closing backend

This is the version on all three nodes:

# /usr/sbin/mysqld --version
/usr/sbin/mysqld Ver 5.6.22-72.0-56 for debian-linux-gnu on x86_64 (Percona XtraDB Cluster (GPL), Release rel72.0, Revision 978, WSREP version 25.8, wsrep_25.8.r4150)

What could cause this problem?

Manel Martinez (zirus86)
description: updated
Revision history for this message
Jervin R (revin) wrote :

This is also seen on 5.6.25 PXC

2015-10-02 20:21:40 19051 [Note] WSREP: Receiving IST: 348517 writesets, seqnos 5313805227-5314153744
2015-10-02 20:21:40 19051 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.6.25-73.1-56' socket: '/var/run/mysqld/mysqld.sock' port: 3306 Percona XtraDB Cluster (GPL), Release rel73.1, Revision 011f1e6, WSREP version 25.12, wsrep_25.12
2015-10-02 20:21:40 19051 [ERROR] WSREP: RecordSet checksum does not match:
computed: 9beed926 44854eec 57a6b8ea b0d7b3b9
found: 2ef0c835 d1d6c397 324964df d0879aa1: 22 (Invalid argument)
   at galerautils/src/gu_rset.cpp:checksum():392
2015-10-02 20:21:40 19051 [ERROR] WSREP: Writeset deserialization failed: Writeset checksum failed: 22 (Invalid argument)
   at galera/src/write_set_ng.hpp:checksum_fin():813
   at galera/src/trx_handle.cpp:unserialize():268
WS flags: 0
Trx proto: 3
Trx source: 00000000-0000-0000-0000-000000000000
Trx conn_id: 18446744073709551615
Trx trx_id: 18446744073709551615
Trx last_seen: -1
2015-10-02 20:21:40 19051 [ERROR] WSREP: got exception while reading ist stream: Writeset checksum failed: 22 (Invalid argument)
   at galera/src/write_set_ng.hpp:checksum_fin():813
   at galera/src/trx_handle.cpp:unserialize():268
2015-10-02 20:21:40 19051 [ERROR] WSREP: IST didn't contain all write sets, expected last: 5314153744 last received: 5313805242
2015-10-02 20:21:40 19051 [ERROR] WSREP: receiving IST failed, node restart required: IST receiver reported error: 71 (Protocol error)
   at galera/src/ist.cpp:recv():432
2015-10-02 20:21:40 19051 [Note] WSREP: Closing send monitor...
2015-10-02 20:21:40 19051 [Note] WSREP: Closed send monitor.
2015-10-02 20:21:40 19051 [Note] WSREP: gcomm: terminating thread
2015-10-02 20:21:40 19051 [Note] WSREP: gcomm: joining thread
2015-10-02 20:21:40 19051 [Note] WSREP: gcomm: closing backend
2015-10-02 20:21:40 19051 [Note] WSREP: view(view_id(NON_PRIM,b5e85121,34) memb {
  d0f8b70f,0
} joined {
} left {
} partitioned {
  b5e85121,0
  fba34727,0
})
2015-10-02 20:21:40 19051 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2015-10-02 20:21:40 19051 [Note] WSREP: view((empty))
2015-10-02 20:21:40 19051 [Note] WSREP: gcomm: closed
2015-10-02 20:21:40 19051 [Note] WSREP: Flow-control interval: [1800, 1800]
2015-10-02 20:21:40 19051 [Note] WSREP: Received NON-PRIMARY.
2015-10-02 20:21:40 19051 [Note] WSREP: Shifting JOINER -> OPEN (TO: 5314176073)
2015-10-02 20:21:40 19051 [Note] WSREP: Received self-leave message.
2015-10-02 20:21:40 19051 [Note] WSREP: Flow-control interval: [1800, 1800]
2015-10-02 20:21:40 19051 [Note] WSREP: Received SELF-LEAVE. Closing connection.
2015-10-02 20:21:40 19051 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 5314176073)
2015-10-02 20:21:40 19051 [Note] WSREP: RECV thread exiting 0: Success
2015-10-02 20:21:40 19051 [Note] WSREP: recv_thread() joined.
2015-10-02 20:21:40 19051 [Note] WSREP: Closing replication queue.
2015-10-02 20:21:40 19051 [Note] WSREP: Closing slave action queue.
2015-10-02 20:21:40 19051 [Note] WSREP: /usr/sbin/mysqld: Terminated.
Aborted (core dumped)

Revision history for this message
Krunal Bauskar (krunal-bauskar) wrote :

I see there are 2 problems

Problem-1: Write serialization failure (Upfront not sure what is reason behind it)
Problem-2: Crash while rolling back such trx which is also reported as part of this bug "https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1292842" and we recently fixed it. Problem-2 should be part of our next release.

To further investigate Problem-1 may be further information will help

1. Complete error.log
2. my.cnf settings
3. What use-case was carried. If you have a reproducible use-case that you can share then that's best.

Changed in percona-xtradb-cluster:
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for Percona XtraDB Cluster because there has been no activity for 60 days.]

Changed in percona-xtradb-cluster:
status: Incomplete → Expired
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1854

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.