MySQL patches by Codership

Bug #502559
Comment #1

Comment 1 for bug 502559

Revision history for this message

Seppo Jaakola (seppo-jaakola) wrote on 2010-01-03:

Log analysis reveals following:

100103 0:22:25 [Note] WSREP: BF kill, with seqno: 312545
100103 0:22:25 [Note] WSREP: kill trx QUERY_COMMITTING for 31279080

1. Slave applier with seqno 312545, needs to abort local trx which is in committing phase

100103 0:22:25 [Note] [Debug] WSREP: mm_galera.c:mm_galera_pre_commit():1945: i
nterrupted in commit queue for 312547

2. victim has already replicated and is in commit queue, victim's seqno is 312547

100103 0:22:25 [Note] [Debug] WSREP: mm_galera.c:mm_galera_replay_trx():2332: replaying applier in to grab: 1, seqno: 27609851 312547

3. victim starts replaying and enters commit queue again

100103 0:22:25 [Note] [Debug] WSREP: mm_galera.c:process_query_write_set():976: remote trx seqno: 27609852 312548 last_seen_trx: 27609851 27609852, cert: 0
100103 0:22:25 [Note] [Debug] WSREP: galera_job_queue.c:job_queue_start_job():162: job: 0 starting

4. new WS arrives with seqno 312548 . Certification is OK and slave applier starts applying. Note, that replaying victim has passed cert queue, so this later remote WS can proceed. Note also, that this remote trx has seen our victim trx, and therefore does not conflict with it.

100103 0:22:25 [Note] [Debug] WSREP: mm_galera.c:ws_conflict_check():183: conflict check for: job1: 312547 type: 1 job2: 312548 type 0
100103 0:22:25 [Note] [Debug] WSREP: galera_job_queue.c:job_queue_start_job():162: job: 1 starting

5. 312546 must have left commit TO (although not visible here), and replaying victim, can grab commit TO. job queue has our victim and slave applier 312548, running

100103 0:22:25 [ERROR] Slave SQL: Could not execute Update_rows event on table test.sbtest; Can't find record in 'sbtest', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 204, Error_code: 1032

6. ...and then this error. This means that 312548 has deleted the row, that 312547 tries to update. This can happen, because 312548 has seen the effects of 312547, and passes certification