regression test for lp:1019473 causes server abort

Bug #1264809 reported by Seppo Jaakola on 2013-12-29
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
MySQL patches by Codership
Status tracked in 5.6
5.5
High
Seppo Jaakola
5.6
High
Seppo Jaakola
Percona XtraDB Cluster
Status tracked in 5.6
5.5
Undecided
Unassigned
5.6
Undecided
Unassigned

Bug Description

Regression test for lp:1019473 will cause abort due to BF-BF lock conflict in lock0lock.c::lock_rec_has_to_wait():

   if (wsrep_trx_order_before(trx->mysql_thd,
         lock2->trx->mysql_thd) &&
       (type_mode & LOCK_MODE_MASK) == LOCK_X &&
       (lock2->type_mode & LOCK_MODE_MASK) == LOCK_X)
   {
    /* exclusive lock conflicts are not accepted */
    fprintf(stderr, "BF-BF X lock conflict\n");
    lock_rec_print(stderr, lock2);
    abort();
   } else {
...

131226 3:27:03 [Note] WSREP: replay trx: DELETE FROM lp1019473 WHERE fid=4 AND
uid=4 86889

 BF-BF lock conflict
RECORD LOCKS space id 0 page no 663 n bits 1056 index `uid` of table `test`.`lp1
019473` trx id 20C06 lock_mode X locks rec but not gap
Record lock, heap no 98 PHYSICAL RECORD: n_fields 2; compact format; info bits 3
2
 0: len 4; hex 00000004; asc ;;
 1: len 6; hex 0000000084ab; asc ;;

131226 3:27:03 [Note] WSREP: BF conflict, order: 86889 86892

BF-BF X lock conflict
RECORD LOCKS space id 0 page no 663 n bits 1056 index `uid` of table `test`.`lp1019473` trx id 20C06 lock_mode X locks rec but not gap
Record lock, heap no 98 PHYSICAL RECORD: n_fields 2; compact format; info bits 32
 0: len 4; hex 00000004; asc ;;
 1: len 6; hex 0000000084ab; asc ;;

03:27:03 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

key_buffer_size=8388608
read_buffer_size=131072
max_used_connections=10
max_threads=1024
thread_count=9
connection_count=9
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 2248688 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
Thread pointer: 0x33acbb0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7fa96d316e48 thread_stack 0x40000
/tmp/galera/local1/mysql/sbin/mysqld(my_print_stacktrace+0x2e)[0x808f82]
/tmp/galera/local1/mysql/sbin/mysqld(handle_fatal_signal+0x2ec)[0x6cc260]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)[0x7fa994c10cb0]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35)[0x7fa993c42425]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x17b)[0x7fa993c45b8b]
/tmp/galera/local1/mysql/sbin/mysqld[0x9c7afd]
/tmp/galera/local1/mysql/sbin/mysqld[0x9cd11e]
/tmp/galera/local1/mysql/sbin/mysqld[0x9cdc56]
/tmp/galera/local1/mysql/sbin/mysqld[0x9ce45b]
/tmp/galera/local1/mysql/sbin/mysqld[0x8bd8e4]
/tmp/galera/local1/mysql/sbin/mysqld[0x8c6949]
/tmp/galera/local1/mysql/sbin/mysqld[0x87f6be]
/tmp/galera/local1/mysql/sbin/mysqld(_ZN7handler14index_read_mapEPhPKhm16ha_rkey_function+0x4b)[0x6d5bfd]
/tmp/galera/local1/mysql/sbin/mysqld(_ZN14Rows_log_event8find_rowEPK14Relay_log_info+0x448)[0x77add2]
/tmp/galera/local1/mysql/sbin/mysqld(_ZN21Delete_rows_log_event11do_exec_rowEPK14Relay_log_info+0xcf)[0x77bdf7]
/tmp/galera/local1/mysql/sbin/mysqld(_ZN14Rows_log_event14do_apply_eventEPK14Relay_log_info+0x843)[0x77097b]
/tmp/galera/local1/mysql/sbin/mysqld(_Z14wsrep_apply_cbPvPKvmjPK14wsrep_trx_meta+0x589)[0x687bc1]
/tmp/galera/local1/galera/lib/libgalera_smm.so(+0x2d1c36)[0x7fa9926f8c36]
/tmp/galera/local1/galera/lib/libgalera_smm.so(+0x2d1f28)[0x7fa9926f8f28]
/tmp/galera/local1/galera/lib/libgalera_smm.so(_ZN6galera13ReplicatorSMM10replay_trxEPNS_9TrxHandleEPv+0x21e)[0x7fa9926fc3dc]
/tmp/galera/local1/galera/lib/libgalera_smm.so(galera_replay_trx+0xc4)[0x7fa99270d00a]
/tmp/galera/local1/mysql/sbin/mysqld(_Z24wsrep_replay_transactionP3THD+0x281)[0x6897d1]
/tmp/galera/local1/mysql/sbin/mysqld[0x5a96c9]
/tmp/galera/local1/mysql/sbin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcj+0x1145)[0x5aac03]
/tmp/galera/local1/mysql/sbin/mysqld(_Z10do_commandP3THD+0x718)[0x5acea6]
/tmp/galera/local1/mysql/sbin/mysqld(_Z24do_handle_one_connectionP3THD+0x11b)[0x655931]
/tmp/galera/local1/mysql/sbin/mysqld(handle_one_connection+0x52)[0x655aed]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a)[0x7fa994c08e9a]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fa993d003fd]

Related branches

Changed in codership-mysql:
status: New → In Progress
assignee: nobody → Seppo Jaakola (seppo-jaakola)
importance: Undecided → High
milestone: none → 5.5.35-25.10
Seppo Jaakola (seppo-jaakola) wrote :

The problem may relate to fix in: lp:1100496

The anatomy of the issue is as follows:

If a table does not have a primary key, nor any other unique, but has some non-unique keys defined, then DML on the table will result in taking X locks on the non-unique key. Because, the key is non-unique, there will be excessive record X locks on the table, and during parallel replication, some of the excessive X locks, granted for separate transactions, may conflict.

Galera does parallel applying control based on MD5 hash sum on the whole tuple, and it can be that two slave appliers are updating separate rows, but still end up in lock conflict due to these non-unique keys. (if mysql would use full table scan, we would not hit this issue)

Seppo Jaakola (seppo-jaakola) wrote :

A fix trial was pushed into wsrep-5.5, in revision: http://bazaar.launchpad.net/~codership/codership-mysql/wsrep-5.5/revision/3935

Here, we populate all non-unique keys for writes on table which does not have PK (nor other unique key), but has some non-unique keys. This may end up in big number of excessive keys and BF aborts on writes to such tables. But experiments will tell more.

tags: added: i40190
Seppo Jaakola (seppo-jaakola) wrote :

Fix for the wsrep-5.6 made it to the release: https://launchpad.net/codership-mysql/5.6/5.6.16-25.5
It was not possible to set target milestone anymore to that release

Ovais Tariq (ovais-tariq) wrote :

Raghu,

When should we expect the fix for this bug to be merged into PXC and available as a release? I have a customer affected by this bug.

Ovais Tariq (ovais-tariq) wrote :
Download full text (3.2 KiB)

I have a crash similar to the one reported in the bug even after upgrade to 5.6.15-25.5:

BF-BF X lock conflict
RECORD LOCKS space id 6910 page no 97 n bits 160 index `GEN_CLUST_INDEX` of table `db`.`tbl_without_pk` trx id 3254927279 lock_mode X locks rec but not gap
08:12:36 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Please help us make Percona XtraDB Cluster better by reporting any
bugs at https://bugs.launchpad.net/percona-xtradb-cluster

key_buffer_size=8388608
read_buffer_size=131072
max_used_connections=246
max_threads=1002
thread_count=46
connection_count=46
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 407967 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x7fb760000990
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7fb7a6ef8d88 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x35)[0x902645]
/usr/sbin/mysqld(handle_fatal_signal+0x4c4)[0x6801e4]
/lib64/libpthread.so.0(+0xfae0)[0x7fcec84c2ae0]
/lib64/libc.so.6(gsignal+0x35)[0x7fcec6934ba5]
/lib64/libc.so.6(abort+0x17b)[0x7fcec69364bb]
/usr/sbin/mysqld[0x9950db]
/usr/sbin/mysqld[0x99973f]
/usr/sbin/mysqld[0x999ae3]
/usr/sbin/mysqld[0xa0b4ff]
/usr/sbin/mysqld[0xa11679]
/usr/sbin/mysqld[0x968857]
/usr/sbin/mysqld(_ZN7handler11ha_rnd_nextEPh+0x9c)[0x5c087c]
/usr/sbin/mysqld(_ZN14Rows_log_event24do_table_scan_and_updateEPK14Relay_log_info+0x198)[0x89fec8]
/usr/sbin/mysqld(_ZN14Rows_log_event14do_apply_eventEPK14Relay_log_info+0xd8f)[0x8a59df]
/usr/sbin/mysqld(_ZN9Log_event11apply_eventEP14Relay_log_info+0x68)[0x8a6c38]
/usr/sbin/mysqld(_Z14wsrep_apply_cbPvPKvmjPK14wsrep_trx_meta+0x5ce)[0x5bd38e]
/usr/lib64/libgalera_smm.so(_ZNK6galera9TrxHandle5applyEPvPF15wsrep_cb_statusS1_PKvmjPK14wsrep_trx_metaERS6_+0xb1)[0x7fce5c60e211]
/usr/lib64/libgalera_smm.so(+0x1ab345)[0x7fce5c647345]
/usr/lib64/libgalera_smm.so(_ZN6galera13ReplicatorSMM9apply_trxEPvPNS_9TrxHandleE+0x283)[0x7fce5c6481b3]
/usr/lib64/libgalera_smm.so(_ZN6galera13ReplicatorSMM11process_trxEPvPNS_9TrxHandleE+0x45)[0x7fce5c648aa5]
/usr/lib64/libgalera_smm.so(_ZN6galera15GcsActionSource8dispatchEPvRK10gcs_actionRb+0x2c9)[0x7fce5c623a89]
/usr/lib64/libgalera_smm.so(_ZN6galera15GcsActionSource7processEPvRb+0x63)[0x7fce5c623f63]
/usr/lib64/libgalera_smm.so(_ZN6galera13ReplicatorSMM10async_recvEPv+0x93)[0x7fce5c642db3]
/usr/lib64/libgalera_smm.so(galera_recv+0x23)[0x7fce5c657aa3]
/usr/sbin/mysqld[0x5be11f]
/usr/sbin/mysqld(start_wsrep_THD+0x3fe)[0x5ae42e]
/lib64/libpthread.so.0(+0x7ddb)[0x7fcec84baddb]
/lib64/libc.so.6(clone+0x6d)[0x7fcec69e2a1d]

Trying to get some variables.
Some pointers may be invalid an...

Read more...

Ovais Tariq (ovais-tariq) wrote :

Seppo you mentioned here https://bugs.launchpad.net/codership-mysql/+bug/1264809/comments/1 that "if mysql would use full table scan, we would not hit this issue". In this case based on the stack trace it looks as though MySQL indeed was doing a full table scan.

Ovais Tariq (ovais-tariq) wrote :

Attaching the standard backtrace taken using `thread apply all bt`

Ovais Tariq (ovais-tariq) wrote :

Attaching the full backtrace taken using `thread apply all bt full`

Ovais Tariq (ovais-tariq) wrote :
Download full text (10.4 KiB)

Looks like thread 38 and 33 have contending locks and both are doing a full table scan:

--
Thread 38 (Thread 0x7ff571abe700 (LWP 6433)):
+bt
#0 0x00007ffff7bcda5c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00000000009baee3 in os_cond_wait (event=0x525fa580, reset_sig_count=1) at /usr/src/debug/Percona-XtraDB-Cluster-5.6.15/storage/innobase/os/os0sync.cc:214
#2 os_event_wait_low (event=0x525fa580, reset_sig_count=1) at /usr/src/debug/Percona-XtraDB-Cluster-5.6.15/storage/innobase/os/os0sync.cc:610
#3 0x0000000000a2b00a in sync_array_wait_event (arr=0x1504d80, index=0) at /usr/src/debug/Percona-XtraDB-Cluster-5.6.15/storage/innobase/sync/sync0arr.cc:433
#4 0x0000000000a2d461 in mutex_spin_wait (_mutex=0x7fe8e95f3088, high_priority=false, file_name=0xc2b610 "/mnt/workspace/percona-xtradb-cluster-5.6-rpms/label_exp/centos6-64/target/BUILD/Percona-XtraDB-Cluster-5.6.15/storage/innobase/lock/lock0lock.cc", line=6478) at /us
r/src/debug/Percona-XtraDB-Cluster-5.6.15/storage/innobase/sync/sync0sync.cc:671
#5 0x0000000000992abc in mutex_enter_func (mutex=0x7fe8e95f3088, file_name=0xc2b610 "/mnt/workspace/percona-xtradb-cluster-5.6-rpms/label_exp/centos6-64/target/BUILD/Percona-XtraDB-Cluster-5.6.15/storage/innobase/lock/lock0lock.cc", line=6478) at /usr/src/debug/Percona-X
traDB-Cluster-5.6.15/storage/innobase/include/sync0sync.ic:273
#6 pfs_mutex_enter_func (mutex=0x7fe8e95f3088, file_name=0xc2b610 "/mnt/workspace/percona-xtradb-cluster-5.6-rpms/label_exp/centos6-64/target/BUILD/Percona-XtraDB-Cluster-5.6.15/storage/innobase/lock/lock0lock.cc", line=6478) at /usr/src/debug/Percona-XtraDB-Cluster-5.6.
15/storage/innobase/include/sync0sync.ic:350
#7 0x0000000000996890 in lock_rec_convert_impl_to_expl (block=0x7fec37f04c00, rec=0x7fedb7178082 "", index=0x7fe8c801f238, offsets=0x7ff571ab9810) at /usr/src/debug/Percona-XtraDB-Cluster-5.6.15/storage/innobase/lock/lock0lock.cc:6478
#8 0x0000000000999ab1 in lock_clust_rec_read_check_and_lock (flags=<value optimized out>, block=0x7fec37f04c00, rec=<value optimized out>, index=0x7fe8c801f238, offsets=<value optimized out>, mode=<value optimized out>, gap_mode=1024, thr=0x7fe89c014170) at /usr/src/debug/Percona-XtraDB-Cluster-5.6.15/storage/innobase/lock/lock0lock.cc:6820
#9 0x0000000000a0b4ff in sel_set_rec_lock (block=0x7fec37f04c00, rec=0x7fedb7178082 "", index=0x7fe8c801f238, offsets=0x7ff571ab9810, mode=<value optimized out>, type=<value optimized out>, thr=0x7fe89c014170) at /usr/src/debug/Percona-XtraDB-Cluster-5.6.15/storage/innobase/row/row0sel.cc:1012
#10 0x0000000000a11679 in row_search_for_mysql (buf=0x7fe89c0127b0 "\376\rafrica\\dariogusv", mode=1, prebuilt=0x7fe89c013a78, match_mode=<value optimized out>, direction=<value optimized out>) at /usr/src/debug/Percona-XtraDB-Cluster-5.6.15/storage/innobase/row/row0sel.cc:4506
#11 0x0000000000968857 in ha_innobase::general_fetch (this=0x7fe89c010a80, buf=0x7fe89c0127b0 "\376\rafrica\\dariogusv", direction=1, match_mode=0) at /usr/src/debug/Percona-XtraDB-Cluster-5.6.15/storage/innobase/handler/ha_innodb.cc:8969
#12 0x00000000005c087c in handler::ha_rnd_next (this=0x7fe89c010a80, buf=0x7fe89c0127b0 "\376\rafrica\\dar...

Ovais Tariq (ovais-tariq) wrote :

This was definitely not fixed in 5.6.15-25.5.

See my comment https://bugs.launchpad.net/codership-mysql/+bug/1264809/comments/6 which shows that I had a crash even after upgrading to 5.6.15-25.5

@Ovais,

I tested the latest PXC build (with galera 3.5, which was also not available when you tested) and was not able to crash the server.

I used a simplified form of https://github.com/codership/galera/blob/3.x/tests/regressions/lp1019473/run.sh here: https://gist.github.com/fc7f0a3a40c0823d8624

I used the binary equivalent of http://www.percona.com/downloads/TESTING/Percona-XtraDB-Cluster-56/5.6.15-25.5/5.6/771/ and http://www.percona.com/downloads/TESTING/Percona-XtraDB-Cluster-galera-56/galera-3.x/216/

I tested with upto 16 slave threads.

If the test doesn't fit the pattern that you were testing, we can open a
separate bug (since this bug is about regression of lp:1019473), and provide a
testcase if any.

Ovais Tariq (ovais-tariq) wrote :

The test case does fit the patter. Could galera 3.5 have anything to do with preventing the crash? And you are right, when I had tested galera 3.5 was not available.

Alex Yurchenko (ayurchen) wrote :

This is unlikely to do anything with Galera version. Could it be that you tested with wrong MySQL-wsrep or PXC release?

Ovais Tariq (ovais-tariq) wrote :

@Alex,

So initially the customer was running Percona XtraDB Cluster 5.6.15-25.4 when he hit this bug, this is before my comment #6. After Percona XtraDB Cluster 5.6.15-25.5 was released on 3/20, I upgraded the nodes to 5.6.15-25.5 and still had the crash.
Comment #6 and following comments are from crashes on 5.6.15-25.5.

This bug is being hit perhaps in a different way?
In comment #1 Seppo indicated that "if mysql would use full table scan, we would not hit this issue".
However, based on the crash that I had and the stack traces in comment #8 and #9, we can see that MySQL is doing a full table scan, but still hits this issue.

Unfortunately, I do not have the specific test case anymore. The customer added primary keys to the tables that were hit with the issue, so I won't be able to reproduce it anymore.

I suppose you might get some required information from the attached gdb back traces?

One more thing, the PXC that I used for regression test was slightly ahead of
25.5 release (with a few fixes for unique keys both in PXC and Galera), so you may want to give it a try, preferrably
on test environment (if any) wherein it was crashing earlier.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers