BF aborter hangs

Bug #428958 reported by Alex Yurchenko
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MySQL patches by Codership
Fix Released
Critical
Seppo Jaakola
Trunk
Fix Released
Critical
Seppo Jaakola

Bug Description

Running sqlgen with default parameters against 2-node cluster makes one node hang in BF abort phase:

processlist form the failed node:

mysql> show processlist;
+-----+-------------+---------------------------+------+---------+-------+------------------------------------------+------------------+
| Id | User | Host | db | Command | Time | State | Info |
+-----+-------------+---------------------------+------+---------+-------+------------------------------------------+------------------+
| 1 | system user | | NULL | Sleep | 193 | invalidating query cache entries (table) | NULL |
| 2 | system user | | NULL | Sleep | 60240 | wsrep aborter idle | NULL |
| 726 | test | devel.codership.com:50902 | test | Query | 193 | NULL | COMMIT |
| 729 | test | devel.codership.com:42766 | test | Query | 193 | NULL | COMMIT |
| 730 | test | devel.codership.com:44156 | test | Query | 193 | NULL | COMMIT |
| 731 | root | localhost:52980 | NULL | Query | 0 | NULL | show processlist |
+-----+-------------+---------------------------+------+---------+-------+------------------------------------------+------------------+
6 rows in set (0.00 sec)

Error log from the failed node:

WSREP: BF victim is waitingWSREP c_lock is waiting: 21928WSREP: BF lock wait longWSREP: BF lock wait longWSREP: BF lock wait longWSREP: BF lock wait longWSREP: BF lock wait longWSREP: BF lock wait longWSREP: BF lock wait longWSREP: BF lock wait longWSREP: BF lock wait longWSREP: BF lock wait longWSREP: BF lock wait longWSREP: BF lock wait longWSREP: BF lock wait longWSREP: BF lock wait longWSREP: BF lock wait longWSREP: BF lock wait longWSREP: BF lock wait longWSREP: BF lock wait longWSREP: BF lock wait longWSREP: BF lock wait longWSREP: BF lock wait longWSREP: BF lock wait longWSREP: BF lock wait longWSREP: BF lock wait longWSREP: BF lock wait longWSREP: BF lock wait longWSREP: BF lock wait longWSREP: BF lock wait longWSREP: BF lock wait longWSREP: BF lock wait long

Related branches

Revision history for this message
Seppo Jaakola (seppo-jaakola) wrote :

Here is how to reproduce the issue:

1. Start two nodes (abyssinian and bengal) in this case
2. create sqlgen database with: (credentials omitted):
    sqlgen --host=abyssinian --create=1 --tables=4 --rows=4 --duration=0 --users=1
3. start sqlgen load with 4 concurrent users:
  sqlgen --host=abyssinian --host=bengal --create=0 --duration=60
4. monitor processlists in both nodes

Revision history for this message
Seppo Jaakola (seppo-jaakola) wrote :

Hanging happens, when BF applier has a lock conflict with local trx, which is in lock wait state.
Log has following messages:
 BF victim is waiting
 WSREP c_lock is waiting

In this state, the lock request for the BF applier should advance in lock_rec_create() and there call:
lock_cancel_waiting_and_release(), for the conflicting local state lock.

For some reason, this lock cancel does not happen.

Revision history for this message
Seppo Jaakola (seppo-jaakola) wrote :
Download full text (6.5 KiB)

The issue was reproduced with innodb_lock_wait_timeout set to high value. Then it is possible to catch the
BF applier while still waiting for the local lock. Here is innodb status, from the hanging situation:

=====================================
090926 22:46:40 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 45 seconds
----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 13, signal count 13
Mutex spin waits 0, rounds 165, OS waits 6
RW-shared spins 10, OS waits 5; RW-excl spins 2, OS waits 2
------------
TRANSACTIONS
------------
Trx id counter 0 88404
Purge done for trx's n:o < 0 88384 undo n:o < 0 0
History list length 9
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 0 0, not started, process no 6115, OS thread id 140204650731856
MySQL thread id 21, query id 519 bengal 10.0.0.122 root
show innodb status
---TRANSACTION 0 88403, ACTIVE 243 sec, process no 6115, OS thread id 140204650998096 starting index read
mysql tables in use 1, locked 1
LOCK WAIT 4 lock struct(s), heap size 1216, 2 row lock(s), undo log entries 1
MySQL thread id 13, query id 505 10.0.0.101 root Updating
UPDATE comm00 SET x = (x + 1) % 65537, y = ((y*x % 65537) + (43086*y % 65537) + (y*z % 65537)) % 65537, z = ((z*x % 65537) + (z*y % 65537) + (43086*z % 65537)) % 65537, r = 0.041111 WHERE p = 0
Trx read view will not see trx with id >= 0 88404, sees < 0 88382
------- TRX HAS BEEN WAITING 243 SEC FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 0 page no 50 n bits 72 index `PRIMARY` of table `test`.`comm00` trx id 0 88403 lock_mode X locks rec but not gap waiting
Record lock, heap no 2 PHYSICAL RECORD: n_fields 7; compact format; info bits 0
 0: len 4; hex 80000000; asc ;; 1: len 6; hex 000000015946; asc YF;; 2: len 7; hex 000000003608d9; asc 6 ;; 3: len 4; hex 8000000b; asc ;; 4: len 4; hex 80009455; asc U;; 5: len 4; hex 80000507; asc ;; 6: len 8; hex 60b1868bdc53e33f; asc ` S ?;;

------------------
---TRANSACTION 0 88402, ACTIVE 243 sec, process no 6115, OS thread id 140204651264336 starting index read
mysql tables in use 1, locked 1
LOCK WAIT 4 lock struct(s), heap size 1216, 2 row lock(s), undo log entries 1
MySQL thread id 12, query id 497 10.0.0.101 root Updating
UPDATE comm01 SET x = (x + 1) % 65537, y = ((y*x % 65537) + (51377*y % 65537) + (y*z % 65537)) % 65537, z = ((z*x % 65537) + (z*y % 65537) + (51377*z % 65537)) % 65537, r = 0.884950 WHERE p = 3
Trx read view will not see trx with id >= 0 88403, sees < 0 88382
------- TRX HAS BEEN WAITING 243 SEC FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 0 page no 51 n bits 72 index `PRIMARY` of table `test`.`comm01` trx id 0 88402 lock_mode X locks rec but not gap waiting
Record lock, heap no 5 PHYSICAL RECORD: n_fields 7; compact format; info bits 0
 0: len 4; hex 80000003; asc ;; 1: len 6; hex 00000001594b; asc YK;; 2: len 7; hex 000000013b2fbb; asc ;/ ;; 3: len 4; hex 8000000d; asc ;; 4: len 4; hex 80007bf8; asc { ;; 5: len 4; hex 8000b3a9; asc ;; 6: len 8; hex babe0f0709519e3f; asc Q ?;;

------------------
---TRANSACTION 0 88400, ACTIVE 243 sec, process no 6115, OS thread ...

Read more...

Revision history for this message
Seppo Jaakola (seppo-jaakola) wrote :

lock0lock.c:lock_rec_create() had a problem when calling lock_cancel_waiting_and_release().
The lock to cancel was c_lock, which is the the lock that conflicts with BF applier. However c_lock itself
is not waiting. The trx holding c_lock is waiting for some other lock and the fix is to cancel that lock wait
.i.e. c_lock->trx->wait_lock

We assert on the fact that wait_lock remains in effect from lock_rec_other_has_conflicting() until here.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.