Secondary node crashing under heavily concurrent transactional stress

Bug #861901 reported by Patrick Crews
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MySQL patches by Codership
Fix Released
Critical
Seppo Jaakola
5.1
Fix Committed
Critical
Seppo Jaakola
5.5
Fix Released
Critical
Seppo Jaakola

Bug Description

When issuing the following randgen workload against a 2-node setup:
./gentest.pl --gendata=conf/drizzle/translog_drizzle.zz --grammar=conf/drizzle/translog_concurrent1.yy --queries=700 --threads=100 --dsn=dbi:mysql:host=localhost:port=3306:user=root:password='':database=test --sqltrace --debug

It was found that the second node is crashing. This test is designed so that each thread produces 700 transactions of varying length and validity.

There is a modified randgen branch that was used for this test:
bzr branch lp:~patrick-crews/randgen/randgen_codership

Revision history for this message
Vadim Tkachenko (vadim-tk) wrote :

there is error log from 2nd node

110928 18:39:06 [ERROR] Slave SQL: Could not execute Update_rows event on table test.DD; Can't find record in 'DD', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 981, Error_code: 1032
110928 18:39:06 [Warning] WSREP: RBR event 2 apply warning: 120, 126062
110928 18:39:06 [ERROR] WSREP: Failed to apply trx: source: 608bc31a-ea21-11e0-0800-35fa1bfd4d40 version: 1 local: 0 state: CERTIFYING flags: 1 conn_id: 105 trx_id: 1136961 seqnos (l: 9097, g: 126062, s: 126061, d: 126055, ts: 1317249230968256296)
110928 18:39:06 [ERROR] WSREP: Failed to apply app buffer: Π�N
, seqno: 126062, status: WSREP_FATAL
         at galera/src/replicator_smm.cpp:apply_data():77
         at galera/src/replicator_smm.cpp:apply_trx_ws():184
110928 18:39:06 [ERROR] WSREP: Node consistency compromized, aborting...
110928 18:39:06 [Note] WSREP: Closing send monitor...
110928 18:39:06 [Note] WSREP: Closed send monitor.
110928 18:39:06 [Note] WSREP: gcomm: terminating thread
110928 18:39:06 [Note] WSREP: gcomm: joining thread
110928 18:39:06 [Note] WSREP: gcomm: closing backend
110928 18:39:06 [Note] WSREP: evs::proto(31bd488e-ea22-11e0-0800-a819c51eec43, LEAVING, view_id(REG,31bd488e-ea22-11e0-0800-a819c51eec43,2)) uuid 608bc31a-ea21-11e0-0800-35fa1bfd4d40 missing from install message, assuming partitioned
110928 18:39:06 [Note] WSREP: GMCast::handle_stable_view: view(view_id(NON_PRIM,31bd488e-ea22-11e0-0800-a819c51eec43,2) memb {
        31bd488e-ea22-11e0-0800-a819c51eec43,
} joined {
} left {
} partitioned {
        608bc31a-ea21-11e0-0800-35fa1bfd4d40,
})
110928 18:39:06 [Note] WSREP: GMCast::handle_stable_view: view((empty))
110928 18:39:06 [Note] WSREP: New COMPONENT: primary = no, my_idx = 0, memb_num = 1
110928 18:39:06 [Note] WSREP: gcomm: closed
110928 18:39:06 [Note] WSREP: Flow-control interval: [8, 16]
110928 18:39:06 [Note] WSREP: Received NON-PRIMARY.
110928 18:39:06 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 126062)
110928 18:39:06 [Note] WSREP: Received self-leave message.
110928 18:39:06 [Note] WSREP: Flow-control interval: [0, 0]
110928 18:39:06 [Note] WSREP: Received SELF-LEAVE. Closing connection.
110928 18:39:06 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 126062)
110928 18:39:06 [Note] WSREP: RECV thread exiting 0: Success
110928 18:39:06 [Note] WSREP: recv_thread() joined.
110928 18:39:06 [Note] WSREP: Closing slave action queue.
110928 18:39:06 [Note] WSREP: bin/mysqld: Terminated.

Changed in codership-mysql:
assignee: nobody → Seppo Jaakola (seppo-jaakola)
importance: Undecided → Critical
milestone: none → 21.1
Changed in codership-mysql:
status: New → Confirmed
Revision history for this message
Seppo Jaakola (seppo-jaakola) wrote :

This test uses: 'ROLLBACK TO SAVEPOINT' statements on regular basis. When I changed the test grammar so that such ROLLBACKS are not issued, the test was passing successfully.

Looks like the smoking gun lies somewhere in the dark dungeons of rollback processing.

Revision history for this message
Seppo Jaakola (seppo-jaakola) wrote :

The problem seems to happen also with regular ROLLBACK statement. This randgen test issues both ROLLBACK and ROLLBACK TO SAVEPOINT statements.

Changed in codership-mysql:
status: Confirmed → In Progress
Revision history for this message
Seppo Jaakola (seppo-jaakola) wrote :

Change set #3570 (http://bazaar.launchpad.net/~codership/codership-mysql/wsrep-5.5/revision/3570) contains a fix for ROLLBACK statements only.

Revision history for this message
Seppo Jaakola (seppo-jaakola) wrote :

Fix for the remaining ROLLBACK TO SAVEPOINT statement support is in change set #3572 (http://bazaar.launchpad.net/~codership/codership-mysql/wsrep-5.5/revision/3572)

Revision history for this message
Alex Yurchenko (ayurchen) wrote :

judging by related branches it was committed to 5.1 as well

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.