WSREP: BF lock wait long

Bug #1671542 reported by Rodrigo
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Expired
Undecided
Unassigned

Bug Description

I have a cluster with 4 nodes (a,b,c,d). Only two nodes are running queries (a,b). a and b are running queries only from traditional sql slave queries.

"At some point" server a (or b) stop answering to innodb queries.
This is what I get from the logs over and over:

=====================================
2017-03-09 14:53:15 7f78c616e700 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 16 seconds
-----------------
BACKGROUND THREAD
-----------------
srv_master_thread loops: 53270 srv_active, 0 srv_shutdown, 3943 srv_idle
srv_master_thread log flush and writes: 57213
----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 149226
OS WAIT ARRAY INFO: signal count 203024
Mutex spin waits 560247, rounds 5443761, OS waits 86994
RW-shared spins 133454, rounds 2415307, OS waits 42597
RW-excl spins 53362, rounds 928570, OS waits 9315
Spin rounds per wait: 9.72 mutex, 18.10 RW-shared, 17.40 RW-excl
------------------------
LATEST DETECTED DEADLOCK
------------------------
2017-03-09 14:28:10 7f7b9044b700
*** (1) TRANSACTION:
TRANSACTION 58558213874, ACTIVE 0 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 4 lock struct(s), heap size 1184, 2 row lock(s), undo log entries 1
MySQL thread id 1, OS thread handle 0x7f7be67a0700, query id 24338159 System lock
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 2534440695 page no 1351 n bits 128 index `PRIMARY` of table `ft_prod`.`transfers` trx id 58558213874 lock_mode X locks rec but not gap waiting
*** (2) TRANSACTION:
TRANSACTION 58558213870, ACTIVE 0 sec starting index read
mysql tables in use 1, locked 1
7 lock struct(s), heap size 1184, 9 row lock(s), undo log entries 8
MySQL thread id 7252, OS thread handle 0x7f7b9044b700, query id 24338156 System lock
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 2534440695 page no 1351 n bits 128 index `PRIMARY` of table `ft_prod`.`transfers` trx id 58558213870 lock_mode X locks rec but not gap
*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 2534440697 page no 90009 n bits 112 index `PRIMARY` of table `ft_prod`.`files` trx id 58558213870 lock_mode X locks rec but not gap waiting
*** WE ROLL BACK TRANSACTION (2)
------------
TRANSACTIONS
------------
Trx id counter 58558221968
Purge done for trx's n:o < 58558221389 undo n:o < 0 state: running but idle
History list length 1270
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 58558221391, ACTIVE 1422 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 2 lock struct(s), heap size 360, 1 row lock(s)
MySQL thread id 1, OS thread handle 0x7f7be67a0700, query id 24349501 System lock
------- TRX HAS BEEN WAITING 1422 SEC FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 2534440697 page no 90019 n bits 104 index `PRIMARY` of table `ft_prod`.`files` trx id 58558221391 lock_mode X locks rec but not gap waiting
------------------
TABLE LOCK table `ft_prod`.`files` trx id 58558221391 lock mode IX
RECORD LOCKS space id 2534440697 page no 90019 n bits 104 index `PRIMARY` of table `ft_prod`.`files` trx id 58558221391 lock_mode X locks rec but not gap waiting
---TRANSACTION 58558221390, ACTIVE 1422 sec
6 lock struct(s), heap size 1184, 14 row lock(s), undo log entries 14
MySQL thread id 7252, OS thread handle 0x7f7b9044b700, query id 24349499 wsrep in pre-commit stage
TABLE LOCK table `ft_prod`.`jobs` trx id 58558221390 lock mode IX
RECORD LOCKS space id 2534440696 page no 706 n bits 80 index `PRIMARY` of table `ft_prod`.`jobs` trx id 58558221390 lock_mode X locks rec but not gap
TABLE LOCK table `ft_prod`.`transfers` trx id 58558221390 lock mode IX
RECORD LOCKS space id 2534440695 page no 1351 n bits 128 index `PRIMARY` of table `ft_prod`.`transfers` trx id 58558221390 lock_mode X locks rec but not gap
TABLE LOCK table `ft_prod`.`files` trx id 58558221390 lock mode IX
RECORD LOCKS space id 2534440697 page no 90019 n bits 104 index `PRIMARY` of table `ft_prod`.`files` trx id 58558221390 lock_mode X locks rec but not gap
--------
FILE I/O
--------
I/O thread 0 state: waiting for completed aio requests (insert buffer thread)
I/O thread 1 state: waiting for completed aio requests (log thread)
I/O thread 2 state: waiting for completed aio requests (read thread)
I/O thread 3 state: waiting for completed aio requests (read thread)
I/O thread 4 state: waiting for completed aio requests (read thread)
I/O thread 5 state: waiting for completed aio requests (read thread)
I/O thread 6 state: waiting for completed aio requests (read thread)
I/O thread 7 state: waiting for completed aio requests (read thread)
I/O thread 8 state: waiting for completed aio requests (read thread)
I/O thread 9 state: waiting for completed aio requests (read thread)
I/O thread 10 state: waiting for completed aio requests (write thread)
I/O thread 11 state: waiting for completed aio requests (write thread)
I/O thread 12 state: waiting for completed aio requests (write thread)
I/O thread 13 state: waiting for completed aio requests (write thread)
I/O thread 14 state: waiting for completed aio requests (write thread)
I/O thread 15 state: waiting for completed aio requests (write thread)
I/O thread 16 state: waiting for completed aio requests (write thread)
I/O thread 17 state: waiting for completed aio requests (write thread)
Pending normal aio reads: 0 [0, 0, 0, 0, 0, 0, 0, 0] , aio writes: 0 [0, 0, 0, 0, 0, 0, 0, 0] ,
 ibuf aio reads: 0, log i/o's: 0, sync i/o's: 0
Pending flushes (fsync) log: 0; buffer pool: 0
1780393 OS file reads, 14018199 OS file writes, 402044 OS fsyncs
0.00 reads/s, 0 avg bytes/read, 0.00 writes/s, 0.00 fsyncs/s
-------------------------------------
INSERT BUFFER AND ADAPTIVE HASH INDEX
-------------------------------------
Ibuf: size 1, free list len 8612, seg size 8614, 364053 merges
merged operations:
 insert 496458, delete mark 286507, delete 14981
discarded operations:
 insert 0, delete mark 0, delete 0
0.00 hash searches/s, 0.00 non-hash searches/s
---
LOG
---
Log sequence number 61956383148235
Log flushed up to 61956383148235
Pages flushed up to 61956383148235
Last checkpoint at 61956383148235
Max checkpoint age 3396675134
Checkpoint age target 3290529037
Modified age 0
Checkpoint age 0
0 pending log writes, 0 pending chkp writes
11102799 log i/o's done, 0.00 log i/o's/second
----------------------
BUFFER POOL AND MEMORY
----------------------
Total memory allocated 4395630592; in additional pool allocated 0
Total memory allocated by read views 216
Internal hash tables (constant factor + variable factor)
    Adaptive hash index 123587280 (75692648 + 47894632)
    Page hash 1107208 (buffer pool 0 only)
    Dictionary cache 23322966 (17702416 + 5620550)
    File system 1146064 (812272 + 333792)
    Lock system 10627248 (10625704 + 1544)
    Recovery system 0 (0 + 0)
Dictionary memory allocated 5620550
Buffer pool size 262140
Buffer pool size, bytes 4294901760
Free buffers 4096
Database pages 255119
Old database pages 94093
Modified db pages 0
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 446331, not young 29425969
0.00 youngs/s, 0.00 non-youngs/s
Pages read 1779657, created 331614, written 2824562
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
No buffer pool page gets since the last printout
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 255119, unzip_LRU len: 2
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
----------------------
INDIVIDUAL BUFFER POOL INFO
----------------------
---BUFFER POOL 0
Buffer pool size 65535
Buffer pool size, bytes 1073725440
Free buffers 1024
Database pages 63767
Old database pages 23518
Modified db pages 0
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 123879, not young 7931142
0.00 youngs/s, 0.00 non-youngs/s
Pages read 495160, created 290632, written 1002138
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
No buffer pool page gets since the last printout
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 63767, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
---BUFFER POOL 1
Buffer pool size 65535
Buffer pool size, bytes 1073725440
Free buffers 1024
Database pages 63789
Old database pages 23527
Modified db pages 0
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 107752, not young 7132509
0.00 youngs/s, 0.00 non-youngs/s
Pages read 427794, created 13190, written 590790
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
No buffer pool page gets since the last printout
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 63789, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
---BUFFER POOL 2
Buffer pool size 65535
Buffer pool size, bytes 1073725440
Free buffers 1024
Database pages 63773
Old database pages 23521
Modified db pages 0
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 105920, not young 7195586
0.00 youngs/s, 0.00 non-youngs/s
Pages read 422453, created 14412, written 580473
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
No buffer pool page gets since the last printout
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 63773, unzip_LRU len: 1
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
---BUFFER POOL 3
Buffer pool size 65535
Buffer pool size, bytes 1073725440
Free buffers 1024
Database pages 63790
Old database pages 23527
Modified db pages 0
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 108780, not young 7166732
0.00 youngs/s, 0.00 non-youngs/s
Pages read 434250, created 13380, written 651161
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
No buffer pool page gets since the last printout
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 63790, unzip_LRU len: 1
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
--------------
ROW OPERATIONS
--------------
0 queries inside InnoDB, 0 queries in queue
0 read views open inside InnoDB
2 RW transactions active inside InnoDB
0 RO transactions active inside InnoDB
2 out of 1000 descriptors used
Main thread process no. 55881, id 140156727662336, state: sleeping
Number of rows inserted 769287, updated 10524708, deleted 331867, read 10856575
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
----------------------------
END OF INNODB MONITOR OUTPUT
============================
WSREP: BF lock wait long
WSREP: BF lock wait long
WSREP: BF lock wait long
WSREP: BF lock wait long
WSREP: BF lock wait long
WSREP: BF lock wait long
WSREP: BF lock wait long
WSREP: BF lock wait long
WSREP: BF lock wait long
WSREP: BF lock wait long
WSREP: BF lock wait long
WSREP: BF lock wait long
WSREP: BF lock wait long
WSREP: BF lock wait long
WSREP: BF lock wait long
WSREP: BF lock wait long

Revision history for this message
Rodrigo (rodri-bernardo) wrote :

The version I am using is:
5.6.24-72.2-56-log Percona XtraDB Cluster (GPL), Release rel72.2, Revision 1, WSREP version 25.11, wsrep_25.11

Revision history for this message
Rodrigo (rodri-bernardo) wrote :

I have a cluster with 4 nodes (a,b,c,d). Only two nodes are running queries (a,b). a and b are running queries only from traditional sql slave queries.

"At some point" server a (or b) stop answering to innodb queries.
This is what I get from the logs over and over (look at the bottom)

Because the server is still running but not accepting queries. The entire cluster is "frozen".

The only solution (I could find) is killing the the mysqld process and restart the mysql service.

Revision history for this message
Rodrigo (rodri-bernardo) wrote :

No feedback at all from this? :(

Revision history for this message
Krunal Bauskar (krunal-bauskar) wrote :

You are hitting some lock wait issues.

Version you are using quite old and since then we have fixed tons of critical issues in PXC.
Can you upgrade to latest PXC version. There are good chances you will not hit it again.
Still if you continue to hit do send us err-logs/*cnf from all the nodes (if possible).
If you have a reproducible use-case that would help a lot.

Revision history for this message
Rodrigo (rodri-bernardo) wrote :

Thanks for your answer, I will try upgrading as soon as possible.
Is it also possible to downgrade easily?

Revision history for this message
Rodrigo (rodri-bernardo) wrote :

Unfortunately I haven't been able to upgrade yet. But since the last time I have enabled transaction debugging enabled:

# Enable Conflict Logging
wsrep_log_conflicts=ON
wsrep_provider_options="cert.log_conflicts=YES"

This has provided me the following errors messages that helped us find the transaction responsible for the bug triggering:

*** Priority TRANSACTION:
TRANSACTION 59326102868, ACTIVE 0 sec starting index read
mysql tables in use 1, locked 1
1 lock struct(s), heap size 360, 0 row lock(s)
MySQL thread id 2, OS thread handle 0x7f0312884700, query id 588985631 System lock

*** Victim TRANSACTION:
TRANSACTION 59326102867, ACTIVE 0 sec updating or deleting
mysql tables in use 1, locked 1
4 lock struct(s), heap size 1184, 2 row lock(s), undo log entries 1
MySQL thread id 248540, OS thread handle 0x7f02bc368700, query id 588985629 System lock
*** WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 2534440695 page no 2913 n bits 112 index `PRIMARY` of table `ft_prod`.`transfers` trx id 59326102867 lock_mode X locks rec but not gap
2017-04-12 10:39:56 24840 [Note] WSREP: cluster conflict due to high priority abort for threads:
2017-04-12 10:39:56 24840 [Note] WSREP: Winning thread:
   THD: 2, mode: applier, state: executing, conflict: no conflict, seqno: 3024048083
   SQL: (null)
2017-04-12 10:39:56 24840 [Note] WSREP: Victim thread:
   THD: 248540, mode: local, state: executing, conflict: cert failure, seqno: -1
   SQL: (null)
2017-04-12 10:39:56 24840 [Note] WSREP: BF kill (1, seqno: 3024048083), victim: (248540) trx: 59326102867
2017-04-12 10:39:56 24840 [Note] WSREP: Aborting query: void
2017-04-12 10:39:56 24840 [Note] WSREP: victim 59326102867 in state 7

Revision history for this message
Rodrigo (rodri-bernardo) wrote :

In the "other" server:
017-04-12 10:39:56 48628 [Note] WSREP: trx conflict for key (1,FLAT8)258f8b17 b6cf10ba: source: 37d7f177-0edb-11e7-b60d-b74af42e4eb1 version: 3 local: 0 state: CERTIFYING flags: 1 conn_id: 248540 trx_id: 59326102867 seqnos (l: 259495664, g: 3024048085, s: 3024048082, d: 3024048073, ts: 7832406975966801) <--X--> source: 7c323254-0eda-11e7-a7d4-869ec07b9367 version: 3 local: 1 state: COMMITTED flags: 1 conn_id: 255129 trx_id: 59329461256 seqnos (l: 259495663, g: 3024048084, s: 3024048083, d: 3024048083, ts: 7832808259707491)

Changed in percona-xtradb-cluster:
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for Percona XtraDB Cluster because there has been no activity for 60 days.]

Changed in percona-xtradb-cluster:
status: Incomplete → Expired
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1959

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.