Percona XtraDB Cluster - HA scalable solution for MySQL

WSREP: FSM: no such a transition COMMITTING -> REPLICATING

Reported by Joep on 2012-12-10
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MySQL patches by Codership
Low
Seppo Jaakola
Percona XtraDB Cluster
Status tracked in Trunk
Trunk
Undecided
Unassigned

Bug Description

The mysqld process crashes after only a few queries. From this error, it's impossible to see what went wrong. Could be a bug, could be an error on our end, but there's no way to tell.

Setup:

3 - node cluster (test setup)
All on virtual machines 2 CPU / 2 GB (node3, 4 & 5)
All InnoDB
Relevant my.cnf parts: (taken from node3)

----------------------------------------------------------------------------------------------------------------------------
# CACHES AND LIMITS #
tmp_table_size = 32M
max_heap_table_size = 32M
query_cache_type = 0
query_cache_size = 0
max_connections = 200
thread_cache_size = 20
open_files_limit = 65535
table_definition_cache = 4096
table_open_cache = 4096

# INNODB #
innodb_flush_method = O_DIRECT
innodb_log_files_in_group = 2
innodb_log_file_size = 128M
innodb_flush_log_at_trx_commit = 2
innodb_file_per_table = 1
innodb_buffer_pool_size = 1488M

server_id=1
wsrep_cluster_address="gcomm://node4,node5"
wsrep_provider=/usr/lib/libgalera_smm.so
wsrep_provider_options = "gmcast.listen_addr=tcp://node3; sst.recv_addr=node3; evs.keepalive_period = PT3S; evs.inactive_check_period = PT10S; evs.suspect_timeout = PT30S; evs.inactive_timeout = PT1M; evs.install_timeout = PT1M"
wsrep_slave_threads=2
wsrep_cluster_name=**removed**
wsrep_sst_method=xtrabackup
wsrep_sst_auth=root:**removed**
wsrep_node_name=node3
wsrep_node_incoming_address="node3:4567"
wsrep_node_address="node3"
innodb_locks_unsafe_for_binlog=1
innodb_autoinc_lock_mode=2
----------------------------------------------------------------------------------------------------------------------------

Crash happens also when running only 1 node
Percona Xtradb cluster Version 5.5.28
*no* messages of any importance before first message in trace below
Information about "no such a transition COMMITTING -> REPLICATING" nowhere to be found (manual, forum, documentation, google) but that could also be a lack of searching skills on my end.

==================================================================

121210 9:52:52 [ERROR] WSREP: FSM: no such a transition COMMITTING -> REPLICATING
08:52:52 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Please help us make Percona Server better by reporting any
bugs at http://bugs.percona.com/

key_buffer_size=33554432
read_buffer_size=131072
max_used_connections=7
max_threads=200
thread_count=6
connection_count=6
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 470497 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x520a080
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7f802ede8e20 thread_stack 0x40000
mysqld(my_print_stacktrace+0x2e)[0x7cd73e]
mysqld(handle_fatal_signal+0x481)[0x6b1681]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)[0x7f802e027cb0]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35)[0x7f802dc8f425]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x17b)[0x7f802dc92b8b]
/usr/lib/libgalera_smm.so(_ZN6galera3FSMINS_9TrxHandle5StateENS1_10TransitionENS_10EmptyGuardENS_11EmptyActionEE8shift_toES2_+0x180)[0x7f802c799070]
/usr/lib/libgalera_smm.so(_ZN6galera13ReplicatorSMM9replicateEPNS_9TrxHandleE+0xc5)[0x7f802c78efb5]
/usr/lib/libgalera_smm.so(galera_pre_commit+0xb0)[0x7f802c7a1680]
mysqld(_Z22wsrep_run_wsrep_commitP3THDP10handlertonb+0x753)[0x667c63]
mysqld[0x6689fb]
mysqld(_Z15ha_commit_transP3THDb+0x13b)[0x6b42eb]
mysqld(_Z17trans_commit_stmtP3THD+0x25)[0x65d515]
mysqld(_Z21mysql_execute_commandP3THD+0x63c)[0x5ac45c]
mysqld(_Z11mysql_parseP3THDPcjP12Parser_state+0x2d1)[0x5b29f1]
mysqld[0x5b2b62]
mysqld(_Z16dispatch_command19enum_server_commandP3THDPcj+0x16b9)[0x5b4e79]
mysqld(_Z10do_commandP3THD+0x171)[0x5b5521]
mysqld(_Z24do_handle_one_connectionP3THD+0x180)[0x64fcd0]
mysqld(handle_one_connection+0x51)[0x64fe91]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a)[0x7f802e01fe9a]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f802dd4ccbd]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (7f7f90004c50): is an invalid pointer
Connection ID (thread ID): 44
Status: NOT_KILLED

==================================================================

affects: percona-server → percona-xtradb-cluster
Joep (4-tmp) wrote :

Extra information: it seems like this is only happening when running queries via PHP. When executing exactly the same set of queries using cat queries.sql | mysql -u root -p -D etc etc nothing goes wrong.

Following packages installed on Ubuntu 12.04 LTS:

percona-xtrabackup
percona-xtradb-cluster-client-5.5
percona-xtradb-cluster-common-5.5
percona-xtradb-cluster-galera-2.x
percona-xtradb-cluster-server-5.5

<guessing>
It looks like this is one of the tables which, when a record is inserted via PHP, causes the crash. We've seen several different queries on which the server crashes, but it looks like it's limited to 3 tables. This is one of them:

CREATE TABLE `CRMentityformcache` (
  `formcacheid` mediumint(9) NOT NULL AUTO_INCREMENT,
  `eid` mediumint(9) NOT NULL,
  `formid` mediumint(9) NOT NULL,
  `tabletype` varchar(15) NOT NULL DEFAULT 'entity',
  `user` mediumint(9) NOT NULL,
  `content` varchar(255) DEFAULT NULL,
  `timestamp_last_change` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`formcacheid`)
) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=utf8 COMMENT='Interleave Cached parsed forms'

Application log show the last query executed which (probably) was directly causing the crash. Again, when executing this on the mysql prompt, all is ok and via PHP this query doesn't ALWAYS crash the server - only about 40-60% of the time (2 to 3 pageloads).

 SQL: ERROR_LAST_QUERY_WENT_WRONG Lost connection to MySQL server during query query : DELETE FROM CRMentityformcache WHERE user='1'

Of course the tables are checked and ok and the software / queries / etc run fine on a stock MySQL server.
</guessing>

Joep (4-tmp) wrote :
Download full text (3.2 KiB)

Completely reinstalled the cluster with completely new config. This time on dedicated hardware (6-core, 32GB servers). Exactly the same happens; as soon as some queries are executed the server on which those queries were excuted crashes. We're beginning to feel a bit hopeless about this. This MUST be our fault but we can't figure out what's happening. And then again - we don't think the crashing is by design either. It would be nice if someone could point us in SOME direction... Bump.

121213 21:12:00 [ERROR] WSREP: FSM: no such a transition REPLICATING -> REPLICATING
20:12:00 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Please help us make Percona Server better by reporting any
bugs at http://bugs.percona.com/

key_buffer_size=33554432
read_buffer_size=131072
max_used_connections=6
max_threads=500
thread_count=6
connection_count=6
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1127092 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x2efacd20
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7fba485c7e20 thread_stack 0x40000
mysqld(my_print_stacktrace+0x2e)[0x7cd73e]
mysqld(handle_fatal_signal+0x481)[0x6b1681]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)[0x7fba5b47ecb0]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35)[0x7fba5b0e8445]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x17b)[0x7fba5b0ebbab]
/usr/lib/libgalera_smm.so(_ZN6galera3FSMINS_9TrxHandle5StateENS1_10TransitionENS_10EmptyGuardENS_11EmptyActionEE8shift_toES2_+0x180)[0x7fba59bf2070]
/usr/lib/libgalera_smm.so(_ZN6galera13ReplicatorSMM9replicateEPNS_9TrxHandleE+0xc5)[0x7fba59be7fb5]
/usr/lib/libgalera_smm.so(galera_pre_commit+0xb0)[0x7fba59bfa680]
mysqld(_Z22wsrep_run_wsrep_commitP3THDP10handlertonb+0x753)[0x667c63]
mysqld[0x6689fb]
mysqld(_Z15ha_commit_transP3THDb+0x13b)[0x6b42eb]
mysqld(_Z17trans_commit_stmtP3THD+0x25)[0x65d515]
mysqld(_Z21mysql_execute_commandP3THD+0x63c)[0x5ac45c]
mysqld(_Z11mysql_parseP3THDPcjP12Parser_state+0x2d1)[0x5b29f1]
mysqld[0x5b2b62]
mysqld(_Z16dispatch_command19enum_server_commandP3THDPcj+0x16b9)[0x5b4e79]
mysqld(_Z10do_commandP3THD+0x171)[0x5b5521]
mysqld(_Z24do_handle_one_connectionP3THD+0x180)[0x64fcd0]
mysqld(handle_one_connection+0x51)[0x64fe91]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a)[0x7fba5b476e9a]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fba5b1a44bd]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (7fb334004b60): is an invalid pointer
Connection ID (thread ID): 9
Status: NOT_KILLED

You may download the Percona Server operations manual by visiting
http://www.percona.com/sof...

Read more...

Joep (4-tmp) wrote :
Download full text (5.7 KiB)

OK Narrowed it down. Percona always crashes when running parrallel update queries on the same NON-EXISTING record. Steps to reproduce (on linux) below. Crashes Percona 100% of the tests. Even on 1 node, with or without the rest of the cluster.

#create database
mysql -u root -pMyPass -e "CREATE DATABASE crashtest"

#create table t1
mysql -u root -pMyPass -D crashtest -e "CREATE TABLE t1 (id int(11) NOT NULL AUTO_INCREMENT, f1 varchar(50) NOT NULL, PRIMARY KEY (id)) ENGINE=InnoDB DEFAULT CHARSET=utf8"

#don't insert any records, leave the table empty (although it crashes also on tables containing records, doesn't really matter)

#multiple consecutive updates: NO PROBLEM
mysql -u root -pMyPass -D crashtest -e "update t1 set f1='test2' where id=1"
mysql -u root -pMyPass -D crashtest -e "update t1 set f1='test2' where id=1"
mysql -u root -pMyPass -D crashtest -e "update t1 set f1='test2' where id=1"
mysql -u root -pMyPass -D crashtest -e "update t1 set f1='test2' where id=1"
mysql -u root -pMyPass -D crashtest -e "update t1 set f1='test2' where id=1"
mysql -u root -pMyPass -D crashtest -e "update t1 set f1='test2' where id=1"
mysql -u root -pMyPass -D crashtest -e "update t1 set f1='test2' where id=1"
mysql -u root -pMyPass -D crashtest -e "update t1 set f1='test2' where id=1"
mysql -u root -pMyPass -D crashtest -e "update t1 set f1='test2' where id=1"
mysql -u root -pMyPass -D crashtest -e "update t1 set f1='test2' where id=1"
mysql -u root -pMyPass -D crashtest -e "update t1 set f1='test2' where id=1"
mysql -u root -pMyPass -D crashtest -e "update t1 set f1='test2' where id=1"
mysql -u root -pMyPass -D crashtest -e "update t1 set f1='test2' where id=1"
mysql -u root -pMyPass -D crashtest -e "update t1 set f1='test2' where id=1"
mysql -u root -pMyPass -D crashtest -e "update t1 set f1='test2' where id=1"
mysql -u root -pMyPass -D crashtest -e "update t1 set f1='test2' where id=1"

#multiple simultaneous updates: CRASH
mysql -u root -pMyPass -D crashtest -e "update t1 set f1='test2' where id=1" &
mysql -u root -pMyPass -D crashtest -e "update t1 set f1='test2' where id=1" &
mysql -u root -pMyPass -D crashtest -e "update t1 set f1='test2' where id=1" &
mysql -u root -pMyPass -D crashtest -e "update t1 set f1='test2' where id=1" &
mysql -u root -pMyPass -D crashtest -e "update t1 set f1='test2' where id=1" &
mysql -u root -pMyPass -D crashtest -e "update t1 set f1='test2' where id=1" &
mysql -u root -pMyPass -D crashtest -e "update t1 set f1='test2' where id=1" &
mysql -u root -pMyPass -D crashtest -e "update t1 set f1='test2' where id=1" &
mysql -u root -pMyPass -D crashtest -e "update t1 set f1='test2' where id=1" &
mysql -u root -pMyPass -D crashtest -e "update t1 set f1='test2' where id=1" &
mysql -u root -pMyPass -D crashtest -e "update t1 set f1='test2' where id=1" &
mysql -u root -pMyPass -D crashtest -e "update t1 set f1='test2' where id=1" &
mysql -u root -pMyPass -D crashtest -e "update t1 set f1='test2' where id=1" &
mysql -u root -pMyPass -D crashtest -e "update t1 set f1='test2' where id=1" &
mysql -u root -pMyPass -D crashtest -e "update t1 set f1='test2' where id=1" &
mysql -u root -pMyPass -D crashtest -e "update t1 set f1='test2' ...

Read more...

description: updated
Changed in percona-xtradb-cluster:
status: New → Confirmed
Changed in codership-mysql:
status: New → Confirmed
Changed in codership-mysql:
assignee: nobody → Seppo Jaakola (seppo-jaakola)
milestone: none → 5.5.28-23.8
Changed in codership-mysql:
status: Confirmed → In Progress
Seppo Jaakola (seppo-jaakola) wrote :

This test is using statement replication (binlog_format=STATEMENT), and this is not fully supported in Galera replication. However, a crash is not a good response for configuration problem, so this bug must be fixed to give a proper warning etc... for end user.

Joep, you should be able to run your test, if you configure your application to use the row based replication, i.e. write in your my.cnf file:

binlog_format = ROW

The problem anatomy is as follows:
1. a non effective DML processes and generates a query log binlog event
2. the DML does not interact with Galera because no rows were affected, so Galera transaxctio ID remains uninitialized
3. at commit time, the query log event is packed in write set, and replicated with the uninitialized trx ID
This is not fatal per se, but if several such transactions replicate at same time, they will use same uninitialized transaction ID, and this can cause problem in Galera replication.

As a fix, replication should be skipped if no valid transaction ID has been assigned.

Changed in codership-mysql:
importance: Undecided → Low
Joep (4-tmp) wrote :

Confirmed. Crash is gone after setting binlog_format = ROW. Our configuration was made with the Percona Configuration Wizard, which didn't mention binlog_format. The rest is all default. Maybe it's an idea to put binlog_format=ROW in the wizard, too. At least as long as this issue isn't fixed.

Anyway, thanks a bunch for helping us out. Our cluster seems stable now, so we can continue testing!

Seppo Jaakola (seppo-jaakola) wrote :

Fix pushed in wsrep-5.5 trunk: http://bazaar.launchpad.net/~codership/codership-mysql/wsrep-5.5/revision/3838
This will simply skip replication, if transaction did not modify any rows. binlog format is not checked.

Changed in codership-mysql:
status: In Progress → Fix Committed
Changed in percona-xtradb-cluster:
milestone: none → 5.5.29-23.8
Changed in percona-xtradb-cluster:
milestone: 5.5.29-23.8 → 5.5.29-27.3.1
status: Confirmed → Fix Committed
milestone: 5.5.29-27.3.1 → 5.5.29-23.7.1
Changed in codership-mysql:
milestone: 5.5.29-24.8 → 5.5.29-23.7.1
Changed in percona-xtradb-cluster:
status: Fix Committed → Fix Released
Alex Yurchenko (ayurchen) wrote :

fix released in 23.7.3

Changed in codership-mysql:
status: Fix Committed → Fix Released
no longer affects: percona-xtradb-cluster/5.6
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers