FSM no such a transition ROLLED_BACK, mysqld got signal 6

Bug #1261688 reported by Chriss
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Status tracked in 5.6
5.6
Fix Released
Critical
Krunal Bauskar

Bug Description

CentOS release 6.5 (Final)
kernel 2.6.32-431.el6.x86_64
Percona-XtraDB-Cluster-shared-56-5.6.14-25.1.571.rhel6.x86_64
Percona-XtraDB-Cluster-test-56-5.6.14-25.1.571.rhel6.x86_64
percona-xtrabackup-2.1.6-702.rhel6.x86_64
Percona-XtraDB-Cluster-galera-56-3.1-1.169.rhel6.x86_64
Percona-XtraDB-Cluster-server-56-5.6.14-25.1.571.rhel6.x86_64
percona-release-0.0-1.x86_64
percona-toolkit-2.2.5-2.noarch
Percona-XtraDB-Cluster-client-56-5.6.14-25.1.571.rhel6.x86_64

2013-12-16 16:10:06 14575 [Note] WSREP: 0.0 (ISP01-Node03-web01): State transfer from 1.0 (ISP01-Node01) complete.
2013-12-16 16:10:06 14575 [Note] WSREP: Member 0 (ISP01-Node03-web01) synced with group.
2013-12-17 10:30:11 14575 [Warning] WSREP: SQL statement was ineffective, THD: 56, buf: 272
QUERY: UPDATE sys_session SET last_updated = '2013-12-17 10:30:11' WHERE session_id = '3u0g48olu25cev7rk8f0k58nd5'
 => Skipping replication
2013-12-17 10:30:11 14575 [Warning] WSREP: SQL statement was ineffective, THD: 54, buf: 272
QUERY: UPDATE sys_session SET last_updated = '2013-12-17 10:30:11' WHERE session_id = '3u0g48olu25cev7rk8f0k58nd5'
 => Skipping replication
2013-12-17 10:30:11 14575 [ERROR] WSREP: FSM: no such a transition ROLLED_BACK -> ROLLED_BACK
09:30:11 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Please help us make Percona XtraDB Cluster better by reporting any
bugs at https://bugs.launchpad.net/percona-xtradb-cluster

key_buffer_size=268435456
read_buffer_size=268435456
max_used_connections=25
max_threads=252
thread_count=25
connection_count=25
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 198446759 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x3f90be0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7f161520dd38 thread_stack 0x80000
/usr/sbin/mysqld(my_print_stacktrace+0x35)[0x8ff355]
/usr/sbin/mysqld(handle_fatal_signal+0x4c4)[0x67dd34]
/lib64/libpthread.so.0[0x34df40f710]
/lib64/libc.so.6(gsignal+0x35)[0x34df032925]
/lib64/libc.so.6(abort+0x175)[0x34df034105]
/usr/lib64/libgalera_smm.so(_ZN6galera3FSMINS_9TrxHandle5StateENS1_10TransitionENS_10EmptyGuardENS_11EmptyActionEE8shift_toES2_+0x2da)[0x7f168d12ea7a]
/usr/lib64/libgalera_smm.so(_ZN6galera13ReplicatorSMM13post_rollbackEPNS_9TrxHandleE+0x2e)[0x7f168d14970e]
/usr/lib64/libgalera_smm.so(galera_post_rollback+0x61)[0x7f168d165361]
/usr/sbin/mysqld[0x7b6dbe]
/usr/sbin/mysqld(_Z15ha_rollback_lowP3THDb+0x97)[0x5c3ea7]
/usr/sbin/mysqld(_ZN13MYSQL_BIN_LOG8rollbackEP3THDb+0x54)[0x8b93c4]
/usr/sbin/mysqld(_Z17ha_rollback_transP3THDb+0x74)[0x5c3c74]
/usr/sbin/mysqld(_Z15ha_commit_transP3THDbb+0x29e)[0x5c446e]
/usr/sbin/mysqld(_Z17trans_commit_stmtP3THD+0x35)[0x79cff5]
/usr/sbin/mysqld(_Z21mysql_execute_commandP3THD+0x90c)[0x6ff71c]
/usr/sbin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_state+0x608)[0x7050c8]
/usr/sbin/mysqld[0x7051f1]
/usr/sbin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcj+0x1ad4)[0x707474]
/usr/sbin/mysqld(_Z10do_commandP3THD+0x1e3)[0x708843]
/usr/sbin/mysqld(_Z24do_handle_one_connectionP3THD+0x17f)[0x6d20ef]
/usr/sbin/mysqld(handle_one_connection+0x47)[0x6d22c7]
/usr/sbin/mysqld(pfs_spawn_thread+0x12a)[0xb3721a]
/lib64/libpthread.so.0[0x34df4079d1]
/lib64/libc.so.6(clone+0x6d)[0x34df0e8b6d]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (7f15bc100c00): is an invalid pointer
Connection ID (thread ID): 54
Status: NOT_KILLED

You may download the Percona XtraDB Cluster operations manual by visiting
http://www.percona.com/software/percona-xtradb-cluster/. You may find information
in the manual which will help you identify the cause of the crash.
131217 10:30:11 mysqld_safe Number of processes running now: 0
131217 10:30:11 mysqld_safe WSREP: not restarting wsrep node automatically
131217 10:30:11 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended

Tags: i59934
Revision history for this message
Chriss (bst2002) wrote :
Revision history for this message
Seppo Jaakola (seppo-jaakola) wrote :

According to variables, you have binlog_format=STATEMENT, which is not supported atm
Better set to value ROW

Changed in percona-xtradb-cluster:
status: New → Incomplete
Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

Yes, looks like binlog_format is set to STATEMENT.

 Also, in the upcoming 5.6.15-25.2 release, I have added runtime check against binlog_format change.

 One more issue, here is that if binlog_format is not set in
 my.cnf, the default is taken: STATEMENT. Discussed here: https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1243228

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for Percona XtraDB Cluster because there has been no activity for 60 days.]

Changed in percona-xtradb-cluster:
status: Incomplete → Expired
Revision history for this message
mgrennan (mark-grennan) wrote :

I'm running Percona Cluster 5.5.37035.0-55-log on Redhat 6.5 (updated) system.
My binlog_format is set to ROW. This is a single server, pdated from Perconal 5.5.30 (not cluster) intended to be turned into a cluster.

 This error happens several times an hour.

tags: added: i59934
Changed in percona-xtradb-cluster:
status: Expired → New
Revision history for this message
Krunal Bauskar (krunal-bauskar) wrote :

commit 794f3cddb0c194767a760dc51b30b00ab94c55ac
Merge: 304970e be2dd53
Author: Krunal Bauskar <email address hidden>
Date: Mon Nov 16 19:49:55 2015 +0530

    Merge pull request #33 from kbauskar/3.x-pxc-456

    - PXC#456: WSREP: FSM: no such a transition ROLLED_BACK -> ROLLED_BAC…

commit be2dd5305479d621c94ba26992610efd84ca9752
Author: Krunal Bauskar <email address hidden>
Date: Mon Nov 16 10:58:42 2015 +0530

    - PXC#456: WSREP: FSM: no such a transition ROLLED_BACK -> ROLLED_BACK with
      LOAD DATA INFILE

      Issue:
      -----

      LDI for that matter DML statement can fail due to multiple reasons.
      Some probable reasons are:
      - Creating table w/o pk and setting wsrep_certify_nonPK = off
      - Existing bug that causes partitioned table LDI to fail.
      ....etc.

      Statement failure will skip append_key which besides appending key also
      set valid trx_id.
      Such failed statements are rolled back with trx_id = default.
      Galera-Plugin try to check if there is an existing Trx Object with
      given trx_id before creating a new one.

      If there are 2 independent connections (connected to same cluster node)
      and both of these connections execute a failing statement then
      both of them will try to rollback with trx_id = default.

      Logic that cached trx_id to trx-object never considered this situation
      and one of the such connection will get reference to a object that belongs
      to other connection which is logically wrong as both connection are unrelated.
      This also causes operational in-consistency as latter connection accesses
      state already modified by former connection.
      (Causing the famous ROLLBACK -> ROLLBACK assert).

      Solution(s):
      -----------
      (I am listing all possible solution with one we have selected)

      * trx-map should use pair of <trx_id, conn_id> as map key.

      * trx-map should use multi-map with trx_id -> TrxObject
        TrxObject can use valid conn_id (vs -1 for now).
        For valid trx_id there only 1 trx_id -> TrxObject pair
        for default there could be multiple trx_id -> TrxObjects pair
        so proper pair is selected based on conn_id.

      [Both of the above approach needs interface change so ruled out for now]

      * Re-arrange the logic to discard_trx object while holding lock on trx
        so that latter connection will get reference to the object but will
        not be able to operate on it till former one is done.
        (Logically 2 connections are sharing the objects which itself is wrong
         but if this can be made possible with some tweak in the code it will
         introduce flow control as it involves exception handling).

      * Introduce a separate map that will cache pthread_id -> TrxObject if
        trx_id = default.
      (Given the limited changes involved we opted for this solution though
       we would love to sort this out with upstream using interface change
       solutions mentioned above).

Changed in percona-xtradb-cluster:
importance: Undecided → Critical
status: New → Fix Committed
Changed in percona-xtradb-cluster:
assignee: nobody → Krunal Bauskar (krunal-bauskar)
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-926

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.