innodb: unprotected access to was_chosen_as_deadlock_victim

Bug #1233383 reported by Teemu Ollakka on 2013-09-30
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MySQL patches by Codership
Critical
Seppo Jaakola
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Status tracked in 5.6
5.5
Invalid
Undecided
Unassigned
5.6
Fix Released
Undecided
Unassigned

Bug Description

Innodb trx->lock.was_chosen_as_deadlock_victim was protected by kernel_mutex in mysql 5.5 but kernel_mutex has been removed in 5.6. Now trx->lock.was_chosen_as_deadlock_victim is protected by trx->mutex.

Changed in codership-mysql:
milestone: none → 5.6.14-24.1
assignee: nobody → Teemu Ollakka (teemu-ollakka)
importance: Undecided → Critical
Changed in codership-mysql:
assignee: Teemu Ollakka (teemu-ollakka) → Seppo Jaakola (seppo-jaakola)
Seppo Jaakola (seppo-jaakola) wrote :

a potential fix pushed in here: http://bazaar.launchpad.net/~codership/codership-mysql/5.6/revision/3953
tested with 2 node cluster, seriously conflicting sqlgen load through glb

Changed in codership-mysql:
status: New → In Progress
Download full text (6.1 KiB)

There is another related crash with following assertion:

2013-10-03 01:15:02 7ff4cd778700 InnoDB: Assertion failure in thread 140689395910400 in file lock0lock.cc line 7101
InnoDB: Failing assertion: mutex_own(&lock_sys->mutex)

======================================================================================================
(gdb) bt
#0 0x00007ff4f1bb40b1 in pthread_kill () from /usr/lib/libpthread.so.0
#1 0x0000000000941cbf in my_write_core (sig=6) at /media/Tintin/Work/code/percona-xtradb-cluster/pxc56/Percona-Server/mysys/stacktrace.c:422
#2 0x00000000006d83ec in handle_fatal_signal (sig=6) at /media/Tintin/Work/code/percona-xtradb-cluster/pxc56/Percona-Server/sql/signal_handler.cc:254
#3 <signal handler called>
#4 0x00007ff4eff443d9 in raise () from /usr/lib/libc.so.6
#5 0x00007ff4eff457d8 in abort () from /usr/lib/libc.so.6
#6 0x00000000009a3a9a in lock_cancel_waiting_and_release (lock=lock@entry=0x7ff44c0c1e40) at /media/Tintin/Work/code/percona-xtradb-cluster/pxc56/Percona-Server/storage/innobase/lock/lock0lock.cc:7101
#7 0x000000000096a55d in wsrep_innobase_kill_one_trx (bf_trx=bf_trx@entry=0x7ff45c007cf8, victim_trx=victim_trx@entry=0x7ff44c0182d8, signal=signal@entry=1)
    at /media/Tintin/Work/code/percona-xtradb-cluster/pxc56/Percona-Server/storage/innobase/handler/ha_innodb.cc:17276
#8 0x000000000096ad60 in wsrep_abort_transaction (hton=<optimized out>, bf_thd=<optimized out>, victim_thd=0x2d40800, signal=1 '\001')
    at /media/Tintin/Work/code/percona-xtradb-cluster/pxc56/Percona-Server/storage/innobase/handler/ha_innodb.cc:17368
#9 0x0000000000648e1a in ha_wsrep_abort_transaction (bf_thd=bf_thd@entry=0x2d2ab40, victim_thd=victim_thd@entry=0x2d40800, signal=signal@entry=1 '\001')
    at /media/Tintin/Work/code/percona-xtradb-cluster/pxc56/Percona-Server/sql/handler.cc:7747
#10 0x000000000063b073 in wsrep_abort_thd (bf_thd_ptr=bf_thd_ptr@entry=0x2d2ab40, victim_thd_ptr=victim_thd_ptr@entry=0x2d40800, signal=signal@entry=1 '\001')
    at /media/Tintin/Work/code/percona-xtradb-cluster/pxc56/Percona-Server/sql/wsrep_thd.cc:457
#11 0x00000000006312d4 in wsrep_grant_mdl_exception (requestor_ctx=requestor_ctx@entry=0x2d2ac78, ticket=ticket@entry=0x7ff44c017ef0)
    at /media/Tintin/Work/code/percona-xtradb-cluster/pxc56/Percona-Server/sql/wsrep_mysqld.cc:1386
#12 0x00000000006cae19 in MDL_lock::can_grant_lock (this=this@entry=0x2c90e10, type_arg=MDL_EXCLUSIVE, requestor_ctx=requestor_ctx@entry=0x2d2ac78, ignore_lock_priority=ignore_lock_priority@entry=false)
    at /media/Tintin/Work/code/percona-xtradb-cluster/pxc56/Percona-Server/sql/mdl.cc:1804
#13 0x00000000006cbc68 in MDL_context::try_acquire_lock_impl (this=this@entry=0x2d2ac78, mdl_request=mdl_request@entry=0x7ff45c005160, out_ticket=out_ticket@entry=0x7ff4cd775fb8)
    at /media/Tintin/Work/code/percona-xtradb-cluster/pxc56/Percona-Server/sql/mdl.cc:2123
#14 0x00000000006cbddd in MDL_context::acquire_lock (this=this@entry=0x2d2ac78, mdl_request=0x7ff45c005160, lock_wait_timeout=lock_wait_timeout@entry=31536000)
    at /media/Tintin/Work/code/percona-xtradb-cluster/pxc56/Percona-Server/sql/mdl.cc:2275
#15 0x00000000006cc974 in MDL_context::acquire_locks (this=...

Read more...

trx_mutex also needs to be taken care of.

=== modified file 'Percona-Server/storage/innobase/handler/ha_innodb.cc'
--- Percona-Server/storage/innobase/handler/ha_innodb.cc 2013-10-02 09:14:44 +0000
+++ Percona-Server/storage/innobase/handler/ha_innodb.cc 2013-10-02 20:17:33 +0000
@@ -17273,7 +17273,11 @@
    if (wait_lock) {
     WSREP_DEBUG("canceling wait lock");
     victim_trx->lock.was_chosen_as_deadlock_victim= TRUE;
+ lock_mutex_enter();
+ trx_mutex_enter(victim_trx);
     lock_cancel_waiting_and_release(wait_lock);
+ trx_mutex_exit(victim_trx);
+ lock_mutex_exit();
    }

    wsrep_thd_awake(thd, signal);

fixes it. (Though there is another crash but not related)

Getting

""
2013-10-03 02:35:50 13788 [ERROR] /pxc56/bin/mysqld: Sort aborted: Deadlock found when trying to get lock; try restarting transaction
WSREP: conc slot cancel not supported
WSREP: conc slot cancel not supported
WSREP: conc slot cancel not supported
2013-10-03 02:35:52 13788 [ERROR] /pxc56/bin/mysqld: Sort aborted: Query execution was interrupted
2013-10-03 02:35:52 7fb430336700 InnoDB: Assertion failure in thread 140411879515904 in file lock0lock.cc line 861
InnoDB: Failing assertion: lock_get_wait(lock)
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.

""
from:

#5 0x00007fb4429e97d8 in abort () from /usr/lib/libc.so.6
#6 0x000000000099c5d0 in lock_reset_lock_and_trx_wait (lock=lock@entry=0x7fb3c408e2e8) at /media/Tintin/Work/code/percona-xtradb-cluster/pxc56/Percona-Server/storage/innobase/lock/lock0lock.cc:861
#7 0x00000000009a3bac in lock_cancel_waiting_and_release (lock=lock@entry=0x7fb3c408e2e8) at /media/Tintin/Work/code/percona-xtradb-cluster/pxc56/Percona-Server/storage/innobase/lock/lock0lock.cc:7123
#8 0x000000000096a595 in wsrep_innobase_kill_one_trx (bf_trx=bf_trx@entry=0x7fb3bc007cf8, victim_trx=victim_trx@entry=0x7fb3c40435b8, signal=signal@entry=1)
    at /media/Tintin/Work/code/percona-xtradb-cluster/pxc56/Percona-Server/storage/innobase/handler/ha_innodb.cc:17278
#9 0x000000000096ada9 in wsrep_abort_transaction (hton=<optimized out>, bf_thd=<optimized out>, victim_thd=0x32b2dd0, signal=1 '\001')
    at /media/Tintin/Work/code/percona-xtradb-cluster/pxc56/Percona-Server/storage/innobase/handler/ha_innodb.cc:17372
#10 0x0000000000648e1a in ha_wsrep_abort_transaction (bf_thd=bf_thd@entry=0x3246100, victim_thd=victim_thd@entry=0x32b2dd0, signal=signal@entry=1 '\001')
    at /media/Tintin/Work/code/percona-xtradb-cluster/pxc56/Percona-Server/sql/handler.cc:7747
#11 0x000000000063b073 in wsrep_abort_thd (bf_thd_ptr=bf_thd_ptr@entry=0x3246100, victim_thd_ptr=victim_thd_ptr@entry=0x32b2dd0, signal=signal@entry=1 '\001')
    at /media/Tintin/Work/code/percona-xtradb-cluster/pxc56/Percona-Server/sql/wsrep_thd.cc:457

This implies if lock_mutex is needed earlier.

lock_mutex and trx_mutex are required much earlier. (otherwise it fails as above)

================================================================================

=== modified file 'Percona-Server/storage/innobase/handler/ha_innodb.cc'
--- Percona-Server/storage/innobase/handler/ha_innodb.cc 2013-10-02 09:14:44 +0000
+++ Percona-Server/storage/innobase/handler/ha_innodb.cc 2013-10-02 21:30:59 +0000
@@ -17364,8 +17364,12 @@

        if (victim_trx)
        {
+ lock_mutex_enter();
+ trx_mutex_enter(victim_trx);
                int rcode = wsrep_innobase_kill_one_trx(bf_trx, victim_trx,
                                                        signal);
+ trx_mutex_exit(victim_trx);
+ lock_mutex_exit();
                wsrep_srv_conc_cancel_wait(victim_trx);
                DBUG_RETURN(rcode);
        } else {

================================================
This is because other callers (from lock0lock.cc) to wsrep_innobase_kill_one_trx hold lock_mutex and trx_mutex.

Seppo Jaakola (seppo-jaakola) wrote :
Changed in codership-mysql:
status: In Progress → Fix Committed
Changed in codership-mysql:
status: Fix Committed → Fix Released

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1459

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers