Percona XtraDB Cluster - HA scalable solution for MySQL

innodb: unprotected access to was_chosen_as_deadlock_victim

Reported by Teemu Ollakka on 2013-09-30
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MySQL patches by Codership
Critical
Seppo Jaakola
Percona XtraDB Cluster
Status tracked in Trunk
5.6
Undecided
Unassigned
Trunk
Undecided
Unassigned

Bug Description

Innodb trx->lock.was_chosen_as_deadlock_victim was protected by kernel_mutex in mysql 5.5 but kernel_mutex has been removed in 5.6. Now trx->lock.was_chosen_as_deadlock_victim is protected by trx->mutex.

Changed in codership-mysql:
milestone: none → 5.6.14-24.1
assignee: nobody → Teemu Ollakka (teemu-ollakka)
importance: Undecided → Critical
Changed in codership-mysql:
assignee: Teemu Ollakka (teemu-ollakka) → Seppo Jaakola (seppo-jaakola)
Seppo Jaakola (seppo-jaakola) wrote :

a potential fix pushed in here: http://bazaar.launchpad.net/~codership/codership-mysql/5.6/revision/3953
tested with 2 node cluster, seriously conflicting sqlgen load through glb

Changed in codership-mysql:
status: New → In Progress
Download full text (6.1 KiB)

There is another related crash with following assertion:

2013-10-03 01:15:02 7ff4cd778700 InnoDB: Assertion failure in thread 140689395910400 in file lock0lock.cc line 7101
InnoDB: Failing assertion: mutex_own(&lock_sys->mutex)

======================================================================================================
(gdb) bt
#0 0x00007ff4f1bb40b1 in pthread_kill () from /usr/lib/libpthread.so.0
#1 0x0000000000941cbf in my_write_core (sig=6) at /media/Tintin/Work/code/percona-xtradb-cluster/pxc56/Percona-Server/mysys/stacktrace.c:422
#2 0x00000000006d83ec in handle_fatal_signal (sig=6) at /media/Tintin/Work/code/percona-xtradb-cluster/pxc56/Percona-Server/sql/signal_handler.cc:254
#3 <signal handler called>
#4 0x00007ff4eff443d9 in raise () from /usr/lib/libc.so.6
#5 0x00007ff4eff457d8 in abort () from /usr/lib/libc.so.6
#6 0x00000000009a3a9a in lock_cancel_waiting_and_release (lock=lock@entry=0x7ff44c0c1e40) at /media/Tintin/Work/code/percona-xtradb-cluster/pxc56/Percona-Server/storage/innobase/lock/lock0lock.cc:7101
#7 0x000000000096a55d in wsrep_innobase_kill_one_trx (bf_trx=bf_trx@entry=0x7ff45c007cf8, victim_trx=victim_trx@entry=0x7ff44c0182d8, signal=signal@entry=1)
    at /media/Tintin/Work/code/percona-xtradb-cluster/pxc56/Percona-Server/storage/innobase/handler/ha_innodb.cc:17276
#8 0x000000000096ad60 in wsrep_abort_transaction (hton=<optimized out>, bf_thd=<optimized out>, victim_thd=0x2d40800, signal=1 '\001')
    at /media/Tintin/Work/code/percona-xtradb-cluster/pxc56/Percona-Server/storage/innobase/handler/ha_innodb.cc:17368
#9 0x0000000000648e1a in ha_wsrep_abort_transaction (bf_thd=bf_thd@entry=0x2d2ab40, victim_thd=victim_thd@entry=0x2d40800, signal=signal@entry=1 '\001')
    at /media/Tintin/Work/code/percona-xtradb-cluster/pxc56/Percona-Server/sql/handler.cc:7747
#10 0x000000000063b073 in wsrep_abort_thd (bf_thd_ptr=bf_thd_ptr@entry=0x2d2ab40, victim_thd_ptr=victim_thd_ptr@entry=0x2d40800, signal=signal@entry=1 '\001')
    at /media/Tintin/Work/code/percona-xtradb-cluster/pxc56/Percona-Server/sql/wsrep_thd.cc:457
#11 0x00000000006312d4 in wsrep_grant_mdl_exception (requestor_ctx=requestor_ctx@entry=0x2d2ac78, ticket=ticket@entry=0x7ff44c017ef0)
    at /media/Tintin/Work/code/percona-xtradb-cluster/pxc56/Percona-Server/sql/wsrep_mysqld.cc:1386
#12 0x00000000006cae19 in MDL_lock::can_grant_lock (this=this@entry=0x2c90e10, type_arg=MDL_EXCLUSIVE, requestor_ctx=requestor_ctx@entry=0x2d2ac78, ignore_lock_priority=ignore_lock_priority@entry=false)
    at /media/Tintin/Work/code/percona-xtradb-cluster/pxc56/Percona-Server/sql/mdl.cc:1804
#13 0x00000000006cbc68 in MDL_context::try_acquire_lock_impl (this=this@entry=0x2d2ac78, mdl_request=mdl_request@entry=0x7ff45c005160, out_ticket=out_ticket@entry=0x7ff4cd775fb8)
    at /media/Tintin/Work/code/percona-xtradb-cluster/pxc56/Percona-Server/sql/mdl.cc:2123
#14 0x00000000006cbddd in MDL_context::acquire_lock (this=this@entry=0x2d2ac78, mdl_request=0x7ff45c005160, lock_wait_timeout=lock_wait_timeout@entry=31536000)
    at /media/Tintin/Work/code/percona-xtradb-cluster/pxc56/Percona-Server/sql/mdl.cc:2275
#15 0x00000000006cc974 in MDL_context::acquire_locks (this=...

Read more...

trx_mutex also needs to be taken care of.

=== modified file 'Percona-Server/storage/innobase/handler/ha_innodb.cc'
--- Percona-Server/storage/innobase/handler/ha_innodb.cc 2013-10-02 09:14:44 +0000
+++ Percona-Server/storage/innobase/handler/ha_innodb.cc 2013-10-02 20:17:33 +0000
@@ -17273,7 +17273,11 @@
    if (wait_lock) {
     WSREP_DEBUG("canceling wait lock");
     victim_trx->lock.was_chosen_as_deadlock_victim= TRUE;
+ lock_mutex_enter();
+ trx_mutex_enter(victim_trx);
     lock_cancel_waiting_and_release(wait_lock);
+ trx_mutex_exit(victim_trx);
+ lock_mutex_exit();
    }

    wsrep_thd_awake(thd, signal);

fixes it. (Though there is another crash but not related)

Getting

""
2013-10-03 02:35:50 13788 [ERROR] /pxc56/bin/mysqld: Sort aborted: Deadlock found when trying to get lock; try restarting transaction
WSREP: conc slot cancel not supported
WSREP: conc slot cancel not supported
WSREP: conc slot cancel not supported
2013-10-03 02:35:52 13788 [ERROR] /pxc56/bin/mysqld: Sort aborted: Query execution was interrupted
2013-10-03 02:35:52 7fb430336700 InnoDB: Assertion failure in thread 140411879515904 in file lock0lock.cc line 861
InnoDB: Failing assertion: lock_get_wait(lock)
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.

""
from:

#5 0x00007fb4429e97d8 in abort () from /usr/lib/libc.so.6
#6 0x000000000099c5d0 in lock_reset_lock_and_trx_wait (lock=lock@entry=0x7fb3c408e2e8) at /media/Tintin/Work/code/percona-xtradb-cluster/pxc56/Percona-Server/storage/innobase/lock/lock0lock.cc:861
#7 0x00000000009a3bac in lock_cancel_waiting_and_release (lock=lock@entry=0x7fb3c408e2e8) at /media/Tintin/Work/code/percona-xtradb-cluster/pxc56/Percona-Server/storage/innobase/lock/lock0lock.cc:7123
#8 0x000000000096a595 in wsrep_innobase_kill_one_trx (bf_trx=bf_trx@entry=0x7fb3bc007cf8, victim_trx=victim_trx@entry=0x7fb3c40435b8, signal=signal@entry=1)
    at /media/Tintin/Work/code/percona-xtradb-cluster/pxc56/Percona-Server/storage/innobase/handler/ha_innodb.cc:17278
#9 0x000000000096ada9 in wsrep_abort_transaction (hton=<optimized out>, bf_thd=<optimized out>, victim_thd=0x32b2dd0, signal=1 '\001')
    at /media/Tintin/Work/code/percona-xtradb-cluster/pxc56/Percona-Server/storage/innobase/handler/ha_innodb.cc:17372
#10 0x0000000000648e1a in ha_wsrep_abort_transaction (bf_thd=bf_thd@entry=0x3246100, victim_thd=victim_thd@entry=0x32b2dd0, signal=signal@entry=1 '\001')
    at /media/Tintin/Work/code/percona-xtradb-cluster/pxc56/Percona-Server/sql/handler.cc:7747
#11 0x000000000063b073 in wsrep_abort_thd (bf_thd_ptr=bf_thd_ptr@entry=0x3246100, victim_thd_ptr=victim_thd_ptr@entry=0x32b2dd0, signal=signal@entry=1 '\001')
    at /media/Tintin/Work/code/percona-xtradb-cluster/pxc56/Percona-Server/sql/wsrep_thd.cc:457

This implies if lock_mutex is needed earlier.

lock_mutex and trx_mutex are required much earlier. (otherwise it fails as above)

================================================================================

=== modified file 'Percona-Server/storage/innobase/handler/ha_innodb.cc'
--- Percona-Server/storage/innobase/handler/ha_innodb.cc 2013-10-02 09:14:44 +0000
+++ Percona-Server/storage/innobase/handler/ha_innodb.cc 2013-10-02 21:30:59 +0000
@@ -17364,8 +17364,12 @@

        if (victim_trx)
        {
+ lock_mutex_enter();
+ trx_mutex_enter(victim_trx);
                int rcode = wsrep_innobase_kill_one_trx(bf_trx, victim_trx,
                                                        signal);
+ trx_mutex_exit(victim_trx);
+ lock_mutex_exit();
                wsrep_srv_conc_cancel_wait(victim_trx);
                DBUG_RETURN(rcode);
        } else {

================================================
This is because other callers (from lock0lock.cc) to wsrep_innobase_kill_one_trx hold lock_mutex and trx_mutex.

Seppo Jaakola (seppo-jaakola) wrote :
Changed in codership-mysql:
status: In Progress → Fix Committed
Changed in codership-mysql:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers