Enabling wsrep_log_conflicts dynamically causes node to hang

Bug #1293624 reported by Jay Janssen on 2014-03-17
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MySQL patches by Codership
Undecided
Unassigned
Percona XtraDB Cluster
Status tracked in 5.6
5.5
Undecided
Unassigned
5.6
Undecided
Unassigned

Bug Description

Scenario:

3 node cluster, PXC 5.6

[root@node1 ~]# rpm -qa | grep -i percona
Percona-Server-shared-51-5.1.72-rel14.10.597.rhel6.x86_64
Percona-XtraDB-Cluster-galera-3-3.3-1.203.rhel6.x86_64
percona-toolkit-2.2.5-2.noarch
percona-xtrabackup-2.1.7-721.rhel6.x86_64
Percona-XtraDB-Cluster-56-5.6.15-25.4.731.rhel6.x86_64
Percona-XtraDB-Cluster-shared-56-5.6.15-25.3.706.rhel6.x86_64
Percona-XtraDB-Cluster-client-56-5.6.15-25.4.731.rhel6.x86_64
Percona-XtraDB-Cluster-server-56-5.6.15-25.4.731.rhel6.x86_64

I am doing this experiment:

# Create a test table
node1 mysql> create table test.deadlocks( i int unsigned not null primary key, j varchar(32) );
node1 mysql> insert into test.deadlocks values ( 1, NULL );

node1 mysql> begin; update test.deadlocks set j="node1" where i=1;

# Before commit, go to node3 in a separate window:
node3 mysql> begin; update test.deadlocks set j="node3" where i=1;
node3 mysql> commit;

node1 mysql> commit;
node1 mysql> select * from test.deadlocks;

This works fine, but if I do this on node1 and re-do the experiment:

node1 mysql> set global wsrep_log_conflicts=ON;

the commit on node1 hangs indefinitely.

node1 mysql> set global wsrep_log_conflicts=ON;
Query OK, 0 rows affected (0.00 sec)

node1 mysql> begin;
Query OK, 0 rows affected (0.00 sec)

node1 mysql> update test.deadlocks set j="node1" where i=1; Query OK, 1 row affected (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 0

node1 mysql> commit;
^CCtrl-C -- sending "KILL QUERY 19" to server ...
^C^C^C^C^C

I get this in the log:

2014-03-17 15:07:15 32710 [Note] WSREP: cluster conflict due to certification failure for threads:
2014-03-17 15:07:15 32710 [Note] WSREP: Victim thread:
   THD: 19, mode: local, state: executing, conflict: cert failure, seqno: 213333
   SQL: commit

I have to kill the node after this to get it back to a healthy state.

Related branches

With UNIV_DEBUG:

2014-03-17 21:24:14 54487 [Note] WSREP: TO BEGIN: -1, 0 : create table test.deadlocks( i int unsigned not null primary key, j varchar(32) )
2014-03-17 21:24:14 54487 [Note] WSREP: TO BEGIN: 561466, 2
2014-03-17 21:24:14 54487 [Note] WSREP: TO END: 561466, 2 : create table test.deadlocks( i int unsigned not null primary key, j varchar(32) )
2014-03-17 21:24:14 54487 [Note] WSREP: TO END: 561466
########################################
DEADLOCK of threads detected!
Mutex 0x3fc2748 owned by thread 140133248501504 file /media/Oort/ncode/percona-xtradb-cluster/pxc56/Percona-Server/storage/innobase/lock/lock0lock.cc line 2456
--Thread 140133248501504 has waited at lock0lock.cc line 1642 for 0.0000 seconds the semaphore:
Mutex at 0x3fc2748 '&trx_sys->mutex', lock var 1
Last time reserved in file /media/Oort/ncode/percona-xtradb-cluster/pxc56/Percona-Server/storage/innobase/lock/lock0lock.cc line 2456, waiters flag 1
########################################
2014-03-17 21:26:05 7f73507f8700 InnoDB: Assertion failure in thread 140133248501504 in file sync0arr.cc line 426
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
15:56:05 UTC - mysqld got signal 6 ;

Download full text (10.3 KiB)

This is due to wsrep_log_conflicts logic in PXC tree (not in codership tree).

However, using codership conflict logic here leads to

2014-03-17 22:15:01 13605 [Note] WSREP: Provider paused at f7e31510-9958-11e3-82f8-abba51ecd1d8:561479 (5)
2014-03-17 22:15:03 13605 [Note] WSREP: resuming provider at 5
2014-03-17 22:15:03 13605 [Note] WSREP: Provider resumed.
2014-03-17 22:15:03 13605 [Note] WSREP: 0.0 (Arch1): State transfer to 1.0 (Arch2) complete.
2014-03-17 22:15:03 13605 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 561479)
2014-03-17 22:15:03 13605 [Note] WSREP: Member 0 (Arch1) synced with group.
2014-03-17 22:15:03 13605 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 561479)
2014-03-17 22:15:03 13605 [Note] WSREP: Synchronized with group, ready for connections
2014-03-17 22:15:03 13605 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
WSREP_SST: [INFO] Total time on donor: 0 seconds (20140317 22:15:03.272)
[Thread 0x7fff7effd700 (LWP 13917) exited]
2014-03-17 22:15:09 13605 [Note] WSREP: 1.0 (Arch2): State transfer from 0.0 (Arch1) complete.
2014-03-17 22:15:09 13605 [Note] WSREP: Member 1 (Arch2) synced with group.
2014-03-17 22:15:43 13605 [Note] WSREP: cluster conflict due to high priority abort for threads:
2014-03-17 22:15:43 13605 [Note] WSREP: Winning thread:
   THD: 3, mode: applier, state: executing, conflict: no conflict, seqno: 561481
   SQL: (null)
2014-03-17 22:15:43 13605 [Note] WSREP: Victim thread:
   THD: 5, mode: local, state: idle, conflict: no conflict, seqno: -1
   SQL: (null)
2014-03-17 22:15:43 13605 [Note] WSREP: BF kill (1, seqno: 561481), victim: (5) trx: 1792773
2014-03-17 22:15:43 13605 [Note] WSREP: Aborting query: void
2014-03-17 22:15:43 13605 [Note] WSREP: kill IDLE for 1792773
2014-03-17 22:15:43 13605 [Note] WSREP: enqueuing trx abort for (5)
2014-03-17 22:15:43 13605 [Note] WSREP: signaling aborter
2014-03-17 22:15:43 13605 [Note] WSREP: WSREP rollback thread wakes for signal
2014-03-17 22:15:43 13605 [Note] WSREP: client rollback due to BF abort for (5), query: (null)
2014-03-17 22:15:43 13605 [Note] WSREP: WSREP rollbacker aborted thd: (5 140736823138048)
2014-03-17 22:15:45 13605 [Note] WSREP: Deadlock error for: (null)
InnoDB: sync levels should be > 298 but a level is 297
Mutex '&trx->mutex'
InnoDB: Locked mutex: addr 0x7fffb8037a80 thread 140736630683392 file /media/Oort/ncode/percona-xtradb-cluster/pxc56/Percona-Server/storage/innobase/lock/lock0lock.cc line 2455
InnoDB: sync_thread_levels_g(array, 298) does not hold!
2014-03-17 22:16:05 7fffcce0f700 InnoDB: Assertion failure in thread 140736630683392 in file sync0sync.cc line 1268
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.

But, this is only in UNIV_DEBUG.

#0 0x00007ffff5f5e389 in raise () from /usr/lib/libc.so.6
#1 0x00007ffff5f5f788 in abort () from /usr/...

Alex Yurchenko (ayurchen) wrote :

confirmed for coderhsip-mysql 5.6 branch

Download full text (6.8 KiB)

Fixed as follows:

diff:
=== modified file 'Percona-Server/storage/innobase/lock/lock0lock.cc'
--- Percona-Server/storage/innobase/lock/lock0lock.cc 2014-03-09 13:26:24 +0000
+++ Percona-Server/storage/innobase/lock/lock0lock.cc 2014-03-17 19:19:58 +0000
@@ -1639,7 +1639,6 @@
                        is in the queue*/
                } else if (lock->trx != trx) {
                        if (wsrep_log_conflicts) {
- mutex_enter(&trx_sys->mutex);
                                if (bf_this)
                                        fputs("\n*** Priority TRANSACTION:\n",
                                              stderr);
@@ -1656,7 +1655,6 @@
                                              stderr);
                                trx_print_latched(stderr, lock->trx, 3000);

- mutex_exit(&trx_sys->mutex);
                                fputs("*** WAITING FOR THIS LOCK TO BE GRANTED:\n",
                                      stderr);

------------------------------------------------------------
revno: 752
fixes bug: https://launchpad.net/bugs/1293624
committer: Raghavendra D Prabhu <email address hidden>
branch nick: pxc56
timestamp: Tue 2014-03-18 00:49:58 +0530
message:
  Bug#1293624: Enabling wsrep_log_conflicts dynamically causes node to hang

  Fix merge regression of Bug#1234382 from codership-5.6 tree.

  This merge caused double acquisition of trx_sys->mutex.

  There are issues with codership fix of trx_sys->mutex acquisition in wsrep_kill_victim.

  =======================
  query: (null)
  2014-03-17 22:15:43 13605 [Note] WSREP: WSREP rollbacker aborted thd: (5 140736823138048)
  2014-03-17 22:15:45 13605 [Note] WSREP: Deadlock error for: (null)
  InnoDB: sync levels should be > 298 but a level is 297
  Mutex '&trx->mutex'
  InnoDB: Locked mutex: addr 0x7fffb8037a80 thread 140736630683392 file /media/Oort/ncode/percona-xtradb-cluster/pxc56/Percona-Server/storage/innobase/lock/lock0lock.cc line 2455
  InnoDB: sync_thread_levels_g(array, 298) does not hold!
  2014-03-17 22:16:05 7fffcce0f700 InnoDB: Assertion failure in thread 140736630683392 in file sync0sync.cc line 1268
  InnoDB: We intentionally generate a memory trap.
  InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
  InnoDB: If you get repeated assertion failures or crashes, even
  InnoDB: immediately after the mysqld startup, there may be
  InnoDB: corruption in the InnoDB tablespace. Please refer to
  InnoDB: http://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
  InnoDB: about forcing recovery.

  ====================================

  This is because, it violates the latching order :

  #define SYNC_TRX_SYS 298
  #define SYNC_TRX 297

  from sync0sync.h

  ie. always trx_sys->mutex before trx->mutex.

  The fixes introduced in earlier revision 490 does this by acquiring trx_sys->mutex much higher in lock_rec_lock_slow.

  However, wsrep_log_conflicts in wsrep_kill_victim is not completely thread-safe.

  It is called from several locations:

  wsrep_kill_victim(GLOBAL_SYMBOL_REF,0)
          -lock_rec_other_has_conflicting(Percona-Server/storage...

Read more...

Download full text (10.8 KiB)

Another fix done for this:

------------------------------------------------------------
revno: 754
committer: Raghavendra D Prabhu <email address hidden>
branch nick: pxc56
timestamp: Wed 2014-03-19 02:59:01 +0530
message:
  Bug#1293624: Fix another issue with wsrep_log_conflicts.

  This time it is taking trx_sys->mutex in following stack:

  # 2014-03-17T16:51:33 [2733] #6 0x0000000000ce86c8 in trx_print_latched (f=0x7f8fd8612860, trx=0x7f8d60032cc8, max_query_len=3000) at /mnt/workspace/build-xtradb-cluster-binaries-56/BUILD_TYPE/debug/label_exp/ce
   ntos6-64/Percona-XtraDB-Cluster-5.6.15/storage/innobase/trx/trx0trx.cc:2072
   # 2014-03-17T16:51:33 [2733] #7 0x0000000000baea95 in wsrep_kill_victim (trx=0x7f8d60032cc8, lock=0x7f8d640cef00) at /mnt/workspace/build-xtradb-cluster-binaries-56/BUILD_TYPE/debug/label_exp/centos6-64/Percona-
   XtraDB-Cluster-5.6.15/storage/innobase/lock/lock0lock.cc:1648
   # 2014-03-17T16:51:33 [2733] #8 0x0000000000baec7c in lock_rec_other_has_conflicting (mode=2563, block=0x7f8e7d2e8e28, heap_no=1, trx=0x7f8d60032cc8) at /mnt/workspace/build-xtradb-cluster-binaries-56/BUILD_TYPE
   /debug/label_exp/centos6-64/Percona-XtraDB-Cluster-5.6.15/storage/innobase/lock/lock0lock.cc:1704
   # 2014-03-17T16:51:33 [2733] #9 0x0000000000bb9eb7 in lock_rec_insert_check_and_lock (flags=0, rec=0x7f8ead03c088 "\200", block=0x7f8e7d2e8e28, index=0x7f8d3c05a3e8, thr=0x7f8d3c06adb0, mtr=0x7f8fb876b130, inher
   it=0x7f8fb876b040) at /mnt/workspace/build-xtradb-cluster-binaries-56/BUILD_TYPE/debug/label_exp/centos6-64/Percona-XtraDB-Cluster-5.6.15/storage/innobase/lock/lock0lock.cc:6367
   # 2014-03-17T16:51:33 [2733] #10 0x0000000000d1d3bf in btr_cur_ins_lock_and_undo (flags=0, cursor=0x7f8fb876b600, entry=0x7f8d580727f8, thr=0x7f8d3c06adb0, mtr=0x7f8fb876b130, inherit=0x7f8fb876b040) at /mnt/work
   space/build-xtradb-cluster-binaries-56/BUILD_TYPE/debug/label_exp/centos6-64/Percona-XtraDB-Cluster-5.6.15/storage/innobase/btr/btr0cur.cc:1272
   # 2014-03-17T16:51:33 [2733] #11 0x0000000000d1dba4 in btr_cur_optimistic_insert (flags=0, cursor=0x7f8fb876b600, offsets=0x7f8fb876b698, heap=0x7f8fb876b110, entry=0x7f8d580727f8, rec=0x7f8fb876b690, big_rec=0x7
   f8fb876b688, n_ext=0, thr=0x7f8d3c06adb0, mtr=0x7f8fb876b130) at /mnt/workspace/build-xtradb-cluster-binaries-56/BUILD_TYPE/debug/label_exp/centos6-64/Percona-XtraDB-Cluster-5.6.15/storage/innobase/btr/btr0cur.cc
   :1510
   # 2014-03-17T16:51:33 [2733] #12 0x0000000000c46774 in row_ins_sec_index_entry_low (flags=0, mode=2, index=0x7f8d3c05a3e8, offsets_heap=0x7f8d600a2e00, heap=0x7f8d600a9330, entry=0x7f8d580727f8, trx_id=0, thr=0x7 f8d3c06adb0) at /mnt/workspace/build-xtradb-cluster-binaries-56/BUILD_TYPE/debug/label_exp/centos6-64/Percona-XtraDB-Cluster-5.6.15/storage/innobase/row/row0ins.cc:2812
   # 2014-03-17T16:51:33 [2733] #13 0x0000000000c46e47 in row_ins_sec_index_entry (index=0x7f8d3c05a3e8, entry=0x7f8d580727f8, thr=0x7f8d3c06adb0) at /mnt/workspace/build-xtradb-cluster-binaries-56/BUILD_TYPE/debug/ label_exp/centos6-64/Percona-XtraDB-Cluster-5.6.15/storage/innobase/row/row0ins.cc:2997
   # 2014-03-17T16:51:33 [2733] #14 0x0000000000c46f38 i...

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers