Attempt to match against an empty key | Crash in UNIV_DEBUG

Bug #1274199 reported by Jay Janssen on 2014-01-29
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Galera
Undecided
Alex Yurchenko
Percona XtraDB Cluster
Status tracked in 5.6
5.5
Undecided
Unassigned
5.6
Undecided
Unassigned

Bug Description

[root@ip-10-29-254-10 mysql]# rpm -qa | grep -i percona
Percona-XtraDB-Cluster-shared-55-5.5.34-25.9.607.rhel6.x86_64
Percona-XtraDB-Cluster-client-55-5.5.34-25.9.607.rhel6.x86_64
Percona-XtraDB-Cluster-galera-3-3.2-1.196.rhel6.x86_64
percona-toolkit-2.2.6-1.noarch
Percona-Server-shared-51-5.1.73-rel14.11.603.rhel6.x86_64
percona-xtrabackup-2.1.7-721.rhel6.x86_64
Percona-XtraDB-Cluster-server-55-5.5.34-25.9.607.rhel6.x86_64

Scenario is 2 node cluster.

node2 is wsrep_desync=ON and FLUSH TABLES WITH READ LOCK;

node1 runs: sysbench --db-driver=mysql --test=sysbench_tests/db/update_non_index.lua --rand-type=uniform --rand-init --mysql-user=test --mysql-password=test --mysql-db=test --oltp-table-size=250000 --max-requests=1000000 --num-threads=8 --max-time=0 --report-interval=1 run

Once sysbench finishes on node1, execute UNLOCK TABLES on node2. Node2 gets this error:

Version: '5.5.34-55' socket: '/var/lib/mysql/mysql.sock' port: 3306 Percona XtraDB Cluster (GPL), wsrep_25.9.r3928
140129 15:25:09 [Note] WSREP: Service thread queue flushed.
140129 15:25:09 [Note] WSREP: Synchronized with group, ready for connections
140129 15:25:09 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
140129 15:25:22 [Note] WSREP: Node 1 (ip-10-29-254-10) desyncs itself from group
140129 15:25:22 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 2424911)
140129 15:26:36 [Note] WSREP: Provider paused at 3f186ea4-88ee-11e3-b496-2271cf23d665:2524911
140129 15:27:16 [Note] WSREP: Provider resumed.
140129 16:52:06 [Note] WSREP: Provider paused at 3f186ea4-88ee-11e3-b496-2271cf23d665:2624911
140129 16:53:18 [Note] WSREP: Created page /var/lib/mysql/gcache.page.000000 of size 134217728 bytes
140129 16:54:26 [Note] WSREP: Created page /var/lib/mysql/gcache.page.000001 of size 134217728 bytes
140129 16:55:35 [Note] WSREP: Created page /var/lib/mysql/gcache.page.000002 of size 134217728 bytes
140129 16:56:42 [Note] WSREP: Created page /var/lib/mysql/gcache.page.000003 of size 134217728 bytes
140129 17:07:50 [Note] WSREP: Provider resumed.
140129 17:07:51 [ERROR] WSREP: Attempt to match against an empty key (1,0): 22 (Invalid argument)
         at galera/src/key_set.cpp:throw_match_empty_key():135
140129 17:07:51 [Note] WSREP: applier thread exiting (code:7)
140129 17:07:51 [ERROR] WSREP: node consistency compromised, aborting
140129 17:07:51 [Note] WSREP: starting shutdown
140129 17:07:51 [Note] /usr/sbin/mysqld: Normal shutdown

140129 17:07:51 [Note] WSREP: Stop replication
140129 17:07:51 [Note] WSREP: Closing send monitor...
140129 17:07:51 [Note] WSREP: Closed send monitor.
140129 17:07:51 [Note] WSREP: gcomm: terminating thread
140129 17:07:51 [Note] WSREP: gcomm: joining thread
140129 17:07:51 [Note] WSREP: gcomm: closing backend
140129 17:07:52 [Note] WSREP: view(view_id(NON_PRIM,825f9682-88f9-11e3-9ed8-b732f283d4db,2) memb {
        887a4798-88f9-11e3-9c26-2ecbbc9449b7,0
} joined {
} left {
} partitioned {
        825f9682-88f9-11e3-9ed8-b732f283d4db,0
})
140129 17:07:52 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
140129 17:07:52 [Note] WSREP: view((empty))
140129 17:07:52 [Note] WSREP: gcomm: closed
140129 17:07:52 [Note] WSREP: Flow-control interval: [1000000, 1000000]
140129 17:07:52 [Note] WSREP: Received NON-PRIMARY.
140129 17:07:52 [Note] WSREP: Shifting DONOR/DESYNCED -> OPEN (TO: 3624911)
140129 17:07:52 [Note] WSREP: Received self-leave message.
140129 17:07:52 [Note] WSREP: Flow-control interval: [0, 0]
140129 17:07:52 [Note] WSREP: Received SELF-LEAVE. Closing connection.
140129 17:07:52 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 3624911)
140129 17:07:52 [Note] WSREP: RECV thread exiting 0: Success
140129 17:07:52 [Note] WSREP: recv_thread() joined.
140129 17:07:52 [Note] WSREP: Closing replication queue.
140129 17:07:52 [Note] WSREP: Closing slave action queue.
140129 17:07:54 [Note] WSREP: killing local connection: 18

Note that node2 is also hung at this point: mysqld is there, but it does not accept connections. I'm unsure if this will happen consistently or not.

Related branches

Tested this with 5.6 and sysbench, couldn't replicate it there. Need to check with 5.5.

@Jay,
One more thing, I see 5.5 has been used with 3.x, were you able to replicate this with 5.5 + 2.x as well?

Alex Yurchenko (ayurchen) wrote :

Raghu, this is exclusively Galera-3 issue.

Download full text (4.3 KiB)

Actually with UNIV_DEBUG build it crashes inside galera: (ie. 5.5 with galera3)

(gdb) bt
#0 0x00007ffff637c389 in raise () from /usr/lib/libc.so.6
#1 0x00007ffff637d788 in abort () from /usr/lib/libc.so.6
#2 0x00007ffff5486265 in base_size (buf=<optimized out>, size=4294967295, ver=<optimized out>) at galera/src/key_set.hpp:267
#3 serial_size (size=4294967295, buf=<optimized out>, ver=<optimized out>) at galera/src/key_set.hpp:282
#4 serial_size (size=<optimized out>, buf=<optimized out>) at galera/src/key_set.hpp:208
#5 serial_size (this=<optimized out>) at galera/src/key_set.hpp:212
#6 next_base<galera::KeySet::KeyPart> (this=0x7fff886d9e68) at galerautils/src/gu_rset.hpp:346
#7 next (this=0x7fff886d9e68) at galerautils/src/gu_rset.hpp:420
#8 next (this=0x7fff886d9e68) at galera/src/key_set.hpp:713
#9 galera::Certification::purge_for_trx_v3 (this=0x11ccd18, trx=0x7fff886d9cd0) at galera/src/certification.cpp:120
#10 0x00007ffff5486315 in galera::Certification::purge_for_trx (this=<optimized out>, trx=trx@entry=0x7fff886d9cd0) at galera/src/certification.cpp:147
#11 0x00007ffff548c982 in galera::Certification::PurgeAndDiscard::operator() (this=this@entry=0x7fffe410c630, vt=...) at galera/src/certification.hpp:140
#12 0x00007ffff54863ac in for_each<std::_Rb_tree_iterator<std::pair<long const, galera::TrxHandle*> >, galera::Certification::PurgeAndDiscard> (__f=..., __last=..., __first=...)
    at /usr/include/c++/4.8.2/bits/stl_algo.h:4417
#13 galera::Certification::purge_trxs_upto_ (this=this@entry=0x11ccd18, seqno=seqno@entry=119040, handle_gcache=handle_gcache@entry=true) at galera/src/certification.cpp:937
#14 0x00007ffff548b65d in galera::Certification::append_trx (this=this@entry=0x11ccd18, trx=trx@entry=0x7fff89b82740) at galera/src/certification.cpp:1003
#15 0x00007ffff54b5189 in galera::ReplicatorSMM::cert (this=this@entry=0x11cc480, trx=trx@entry=0x7fff89b82740) at galera/src/replicator_smm.cpp:1667
#16 0x00007ffff54b567b in galera::ReplicatorSMM::pre_commit (this=0x11cc480, trx=0x7fff89b82740, meta=<optimized out>) at galera/src/replicator_smm.cpp:715
#17 0x00007ffff54c51c0 in galera_pre_commit (gh=<optimized out>, conn_id=22, trx_handle=<optimized out>, flags=<optimized out>, meta=0x18f6758) at galera/src/wsrep_provider.cpp:505
#18 0x000000000066474a in wsrep_run_wsrep_commit (thd=thd@entry=0x18f4b20, hton=hton@entry=0x11f5250, all=all@entry=true) at /media/Oort/ncode/percona-xtradb-cluster/trunk-25/sql/wsrep_hton.cc:429
#19 0x0000000000665467 in wsrep_prepare (hton=0x11f5250, thd=0x18f4b20, all=<optimized out>) at /media/Oort/ncode/percona-xtradb-cluster/trunk-25/sql/wsrep_hton.cc:163
#20 0x00000000006bb7cf in ha_commit_trans (thd=thd@entry=0x18f4b20, all=all@entry=true) at /media/Oort/ncode/percona-xtradb-cluster/trunk-25/sql/handler.cc:1384
#21 0x000000000064d6a8 in trans_commit (thd=thd@entry=0x18f4b20) at /media/Oort/ncode/percona-xtradb-cluster/trunk-25/sql/transaction.cc:199
#22 0x00000000005902cc in mysql_execute_command (thd=thd@entry=0x18f4b20) at /media/Oort/ncode/percona-xtradb-cluster/trunk-25/sql/sql_parse.cc:4371
#23 0x0000000000593471 in mysql_parse (thd=thd@entry=0x18f4b20, rawbuf=rawbuf@entry=0x7fff8802...

Read more...

bt full and thread apply all bt full.

It crashes here:

        static size_t
        base_size (Version const ver,
                   const gu::byte_t* const buf, size_t const size)
        {
            switch (ver)
            {
            case FLAT16:
            case FLAT16A:
                return 16;
            case FLAT8:
            case FLAT8A:
                return 8;
            case EMPTY: assert(0);
            }

            abort();
        }

summary: - Attempt to match against an empty key
+ Attempt to match against an empty key | Crash in UNIV_DEBUG
Download full text (76.1 KiB)

With debug=1 + log_info that I added:

140308 22:17:06 [Note] [Debug] WSREP: galera/src/galera_service_thd.cpp:thd_func():67: Reported last committed: 203856
140308 22:17:06 [Note] [Debug] WSREP: gcs/src/gcs_group.c:gcs_group_handle_last_msg():629: New COMMIT CUT 203837 after 203856 from 1
140308 22:17:06 [Note] [Debug] WSREP: galera/src/replicator_smm.cpp:process_commit_cut():1250: Got commit cut from GCS: 203837
WSREP: Deadlock detected
WSREP: Deadlock detected
140308 22:17:06 [Note] [Debug] WSREP: galera/src/galera_service_thd.cpp:thd_func():67: Reported last committed: 203905
WSREP: Deadlock detected
140308 22:17:06 [Note] [Debug] WSREP: galera/src/galera_service_thd.cpp:thd_func():67: Reported last committed: 203954
WSREP: Deadlock detected
140308 22:17:07 [Note] [Debug] WSREP: galera/src/galera_service_thd.cpp:thd_func():67: Reported last committed: 204002
WSREP: Deadlock detected
140308 22:17:07 [Note] [Debug] WSREP: galera/src/galera_service_thd.cpp:thd_func():67: Reported last committed: 204054
WSREP: Deadlock detected
WSREP: Deadlock detected
140308 22:17:07 [Note] [Debug] WSREP: galera/src/galera_service_thd.cpp:thd_func():67: Reported last committed: 204090
WSREP: Deadlock detected
WSREP: Deadlock detected
WSREP: Deadlock detected
WSREP: Deadlock detected
WSREP: Deadlock detected
WSREP: Deadlock detected
140308 22:17:07 [Note] [Debug] WSREP: galera/src/galera_service_thd.cpp:thd_func():67: Reported last committed: 204137
WSREP: Deadlock detected
WSREP: Deadlock detected
WSREP: Deadlock detected
WSREP: Deadlock detected
140308 22:17:08 [Note] [Debug] WSREP: galera/src/galera_service_thd.cpp:thd_func():67: Reported last committed: 204188
140308 22:17:08 [Note] [Debug] WSREP: galera/src/galera_service_thd.cpp:thd_func():67: Reported last committed: 204218
WSREP: Deadlock detected
WSREP: Deadlock detected
WSREP: Deadlock detected
WSREP: Deadlock detected
WSREP: Deadlock detected
WSREP: Deadlock detected
WSREP: Deadlock detected
140308 22:17:08 [Note] [Debug] WSREP: galera/src/galera_service_thd.cpp:thd_func():67: Reported last committed: 204268
WSREP: Deadlock detected
WSREP: Deadlock detected
WSREP: Deadlock detected
140308 22:17:09 [Note] [Debug] WSREP: galera/src/galera_service_thd.cpp:thd_func():67: Reported last committed: 204317
WSREP: Deadlock detected
140308 22:17:09 [Note] [Debug] WSREP: galera/src/galera_service_thd.cpp:thd_func():67: Reported last committed: 204346
WSREP: Deadlock detected
WSREP: Deadlock detected
WSREP: Deadlock detected
140308 22:17:09 [Note] [Debug] WSREP: galera/src/galera_service_thd.cpp:thd_func():67: Reported last committed: 204394
WSREP: Deadlock detected
WSREP: Deadlock detected
WSREP: Deadlock detected
140308 22:17:09 [Note] [Debug] WSREP: galera/src/galera_service_thd.cpp:thd_func():67: Reported last committed: 204444
WSREP: Deadlock detected
WSREP: Deadlock detected
140308 22:17:09 [Note] [Debug] WSREP: galera/src/galera_service_thd.cpp:thd_func():67: Reported last committed: 204470
140308 22:17:09 [Note] [Debug] WSREP: galera/src/galera_service_thd.cpp:thd_func():67: Reported last committed: 204471
WSREP: Deadlock detected
WSREP: Deadlock detected
WSREP: Deadlock detected
140308 22:17:10 [Note]...

From log_info I added:

=== modified file 'galera/src/key_set.hpp'
--- galera/src/key_set.hpp 2013-12-01 16:29:42 +0000
+++ galera/src/key_set.hpp 2014-03-08 16:31:07 +0000
@@ -264,6 +264,7 @@
             case EMPTY: assert(0);
             }

+ log_info << "Version value:" << ver;
             abort();
         }

140308 22:18:46 [Note] [Debug] WSREP: galera/src/certification.cpp:append_trx():986: trx map size: 16451 - check if status.last_committed is incrementing
140308 22:18:46 [Note] WSREP: galera/src/key_set.hpp:base_size():267: Version value:0

Here:

             case FLAT8A:
                return 8;
            case EMPTY: assert(0);
            }

            abort();

Is it missing a 'return 0' here? unless you want to assert (when NDEBUG) and abort as well.

From node2 as per the reporter I see:

140308 22:28:32 [Note] WSREP: Skipping empty log_xid: (null)
140308 22:28:32 [Note] WSREP: Skipping empty log_xid: (null)
140308 22:28:32 [Note] WSREP: Skipping empty log_xid: (null)
140308 22:28:32 [ERROR] WSREP: Attempt to match against an empty key (1,0): 22 (Invalid argument)
         at galera/src/key_set.cpp:throw_match_empty_key():135
140308 22:28:32 [Note] WSREP: applier thread exiting (code:7)
140308 22:28:32 [ERROR] WSREP: node consistency compromised, aborting
140308 22:28:32 [Note] WSREP: starting shutdown
140308 22:28:32 [Note] WSREP: closing applier 7
140308 22:28:32 [Note] WSREP: wsrep running threads now: 8
140308 22:28:32 [Note] WSREP: Skipping empty log_xid: (null)
1

Download full text (21.5 KiB)

Adding 'return 0' :

On node1:
==============
140308 22:36:47 [ERROR] WSREP: galera/src/wsrep_provider.cpp:galera_pre_commit():513: Attempt to match against an empty key (1,0): 22 (Invalid argument)
         at galera/src/key_set.cpp:throw_match_empty_key():135
140308 22:36:47 [ERROR] WSREP: unknown connection failure
140308 22:38:48 [Note] [Debug] WSREP: gcomm/src/evs_proto.cpp:handle_gap():2968: evs::proto(667205c1-a6e3-11e3-bf0b-b683d5ad1a93, GATHER, view_id(REG,667205c1-a6e3-11e3-bf0b-b683d5ad1a93,2)) commit gap from 667205c1-a6e3-11e3-bf0b-b683d5ad1a93
140308 22:38:48 [Note] [Debug] WSREP: gcomm/src/evs_proto.cpp:handle_gap():3000: evs::proto(667205c1-a6e3-11e3-bf0b-b683d5ad1a93, INSTALL, view_id(REG,667205c1-a6e3-11e3-bf0b-b683d5ad1a93,2)): install gap evs::msg{version=0,type=3,user_type=255,order=1,seq=-1,seq_range=-1,aru_seq=-1,flags=0,source=667205c1-a6e3-11e3-bf0b-b683d5ad1a93,source_view_id=view_id(REG,667205c1-a6e3-11e3-bf0b-b683d5ad1a93,3),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=35686,node_list=()
}
140308 22:38:48 [Note] [Debug] WSREP: gcomm/src/pc_proto.cpp:handle_view():538: 667205c1-a6e3-11e3-bf0b-b683d5ad1a93 view(view_id(TRANS,667205c1-a6e3-11e3-bf0b-b683d5ad1a93,2) memb {
        667205c1-a6e3-11e3-bf0b-b683d5ad1a93,0
} joined {
} left {
        6ce84e85-a6e3-11e3-bfce-6f762c1fafc3,0
} partitioned {
})
140308 22:38:48 [Note] [Debug] WSREP: gcomm/src/pc_proto.cpp:handle_trans():451: 667205c1-a6e3-11e3-bf0b-b683d5ad1a93

 current view view(view_id(REG,667205c1-a6e3-11e3-bf0b-b683d5ad1a93,2) memb {
        667205c1-a6e3-11e3-bf0b-b683d5ad1a93,0
        6ce84e85-a6e3-11e3-bfce-6f762c1fafc3,0
} joined {
        6ce84e85-a6e3-11e3-bfce-6f762c1fafc3,0
} left {
} partitioned {
})

 next view view(view_id(TRANS,667205c1-a6e3-11e3-bf0b-b683d5ad1a93,2) memb {
        667205c1-a6e3-11e3-bf0b-b683d5ad1a93,0
} joined {
} left {
        6ce84e85-a6e3-11e3-bfce-6f762c1fafc3,0
} partitioned {
})

 pc view view(view_id(PRIM,667205c1-a6e3-11e3-bf0b-b683d5ad1a93,2) memb {
        667205c1-a6e3-11e3-bf0b-b683d5ad1a93,0
        6ce84e85-a6e3-11e3-bfce-6f762c1fafc3,0
} joined {
} left {
} partitioned {
})
140308 22:38:48 [Note] [Debug] WSREP: gcomm/src/pc_proto.cpp:handle_trans():483: 667205c1-a6e3-11e3-bf0b-b683d5ad1a93 quorum ok
140308 22:38:48 [Note] [Debug] WSREP: gcomm/src/pc_proto.cpp:shift_to():321: 667205c1-a6e3-11e3-bf0b-b683d5ad1a93 shift_to: PRIM -> TRANS prim 1 last prim view_id(PRIM,667205c1-a6e3-11e3-bf0b-b683d5ad1a93,2) to_seq 18696
140308 22:38:48 [Note] [Debug] WSREP: gcomm/src/evs_input_map2.cpp:clear():227: discarding 12 messages from recovery index
140308 22:38:48 [Note] [Debug] WSREP: gcomm/src/evs_input_map2.cpp:reset():153: size 0
140308 22:38:48 [Note] [Debug] WSREP: gcomm/src/evs_input_map2.cpp:reset():159: node: {idx=0,range=[0,-1],safe_seq=-1} size 1
140308 22:38:48 [Note] [Debug] WSREP: gcomm/src/gmcast.cpp:handle_stable_view():1439: GMCast::handle_stable_view: view(view_id(REG,667205c1-a6e3-11e3-bf0b-b683d5ad1a93,3) memb {
        667205c1-a6e3-11e3-bf0b-b683d5ad1a93,0
} joined {
} left {
        6ce84e85-a6e3-11e3-bfce-6f762c1fafc3,0
} partitioned {
})
140308 22:38:48 [Note] [Debug]...

Regarding previous comment,

 140308 22:36:47 [ERROR] WSREP: galera/src/wsrep_provider.cpp:galera_pre_commit():513: Attempt to match against an empty key (1,0): 22 (Invalid argument)
         at galera/src/key_set.cpp:throw_match_empty_key():135
140308 22:36:47 [ERROR] WSREP: unknown connection failure

After this the node1 hangs, probably due to queue being full, unlock tables on node2 continues it.

Alex Yurchenko (ayurchen) wrote :

Looks like it was off-by-one error in certification cleanup.
Fix pushed in http://bazaar.launchpad.net/~codership/galera/3.x/revision/178

Changed in galera:
assignee: nobody → Alex Yurchenko (ayurchen)
milestone: none → 25.3.5
status: New → Fix Committed
Changed in galera:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments