Crash-Resistant Replication feature working wrong

Bug #937852 reported by Aleksandr Khomyakov
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Percona Server moved to https://jira.percona.com/projects/PS
Invalid
High
Unassigned
5.1
Won't Fix
High
Unassigned
5.5
Incomplete
High
Unassigned
5.6
Invalid
High
Unassigned

Bug Description

After crash recovery with option innodb_overwrite_relay_log_info, replication show raight exec_position, but after start position jump back to value from master.info file.

i write about this strange behaviour in forum with well formating http://forum.percona.com/index.php?t=msg&th=2116&start=0
repeat this same text

What i do?

checking with percona server 5.5.20 and 5.1.59. both case running replication capacity about 100 qps.

set safe variables

sync_binlog = 1
innodb_flush_log_at_trx_commit = 1
innodb_overwrite_relay_log_info = 1 or innodb_recovery_update_relay_log = 1
skip-slave-start

need anything else for this test?

i do emulation power failure over SysRq-B.

echo 1 > /proc/sys/kernel/sysrq
echo b > /proc/sysrq-trigger

after server rebooted, see info files.

[root@db40 airo]# cat /local/mysql/data/master.info
15
log-bin.013888
868868572
10.x.xx.xx
replication
replication
3306
60
0

0

[root@db40 airo]# cat /local/mysql/data/relay-log.info
/local/mysql/relaylog/relay-bin.000687
868868715
log-bin.013888
868868572
2

then start mysql server, seen log after recovery

InnoDB: relay-log.info is detected.
InnoDB: relay log: position 868868715, file name /local/mysql/relaylog/relay-bin.000687
InnoDB: master log: position 868868572, file name log-bin.013888
InnoDB: The InnoDB memory heap is disabled

..........

InnoDB: Apply batch completed
InnoDB: In a MySQL replication slave the last master binlog file
InnoDB: position 0 869698571, file name log-bin.013888
InnoDB: and relay log file
InnoDB: position 0 869698714, file name /local/mysql/relaylog/relay-bin.000687
InnoDB: Last MySQL binlog file position 0 1071255772, file name /local/mysql/binlog/log-bin.000404
120221 13:15:20 InnoDB: Restoring buffer pool pages from ib_lru_dump
120221 13:15:20 Percona XtraDB (http://www.percona.com) 1.0.17-12.5 started; log sequence number 483548337856
InnoDB: relay-log.info was overwritten.
120221 13:15:21 [Note] Recovering after a crash using /local/mysql/binlog/log-bin
120221 13:17:19 [Note] Starting crash recovery...
120221 13:17:19 [Note] Crash recovery finished.

recovery done is fine, replication not started because set option skip this action.

10.8.60.40-(none)-13:19:13> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State:
                  Master_Host: 10.x.xx.xx
                  Master_User: replication
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: log-bin.013888
          Read_Master_Log_Pos: 868868572
               Relay_Log_File: relay-bin.000687
                Relay_Log_Pos: 869698714
        Relay_Master_Log_File: log-bin.013888
             Slave_IO_Running: No
            Slave_SQL_Running: No
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 869698571
              Relay_Log_Space: 864055592
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
1 row in set (0.00 sec)

Can be seen that Exec_Master_Log_Pos get from binlog, and differs from master.info. from binlog is bigger pos, than master.info and this position is right.

Okey, everything looks great.

Start replication. And get error - Duplicate entry.

10.8.60.40-(none)-13:21:04> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 10.x.xx.xx
                  Master_User: replication
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: log-bin.013888
          Read_Master_Log_Pos: 988421748
               Relay_Log_File: relay-bin.000689
                Relay_Log_Pos: 249
        Relay_Master_Log_File: log-bin.013888
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 1062
                   Last_Error: Error 'Duplicate entry '15043868' for key 'PRIMARY'' on query. Default database: .. [skip this part]
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 868868572
              Relay_Log_Space: 119553574
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 1062
               Last_SQL_Error: Error 'Duplicate entry '15043868' for key 'PRIMARY'' on query. Default database: .. [skip this part]

We see that Exec position jump back from 869698571 to 868868572. This position replication get from mysql.info.

Revision history for this message
Laurynas Biveinis (laurynas-biveinis) wrote :

It seems that there are several potential issues with the crash-resistant replication. One way to fail 5.1 does not involve the case where slave-relay.info is overwritten:

1) Add a crash injection site at trx_commit_off_kernel. This will trigger during the XA 2PC
commit protocol in the COMMIT phase.
2) Replicate an event from master to slave that will trigger this crash.
3) At the time of the crash the relay log master log position will point to the crashed
prepared transaction at position X, relay log pos will point to Y, InnoDB transactional fields
will point to the same master log position and relay log position Z, Z < Y.
4) On the InnoDB crash recovery InnoDB will undo the prepared transaction.
5) On the binlog crash recovery InnoDB will redo and commit the prepared transaction.
6) The slave will attempt to start replication assuming position X for the master log and
position Y for the relay log.
7) Thus it will attempt to re-execute the transaction that was committed in 5)

Revision history for this message
Laurynas Biveinis (laurynas-biveinis) wrote :

Since the failures are different (and might require separate fixes), splitting out the XA issue to bug 1012715.

Revision history for this message
Laurynas Biveinis (laurynas-biveinis) wrote :

Confirmed the revert to the old master binlog position.

Revision history for this message
Laurynas Biveinis (laurynas-biveinis) wrote :

Aleksandr -

Did you experience the issue with the slave running on bare box, or did you use virtualization?

Revision history for this message
Aleksandr Khomyakov (toairo) wrote :

Hello, Laurynas.
thanks a lot for working under this bug.

for testing slave server i used usual server, profile in attach (pt-summary)

Revision history for this message
Laurynas Biveinis (laurynas-biveinis) wrote :

Hi Aleksandr -

Thanks for the update. Some of the virtualization solutions do not honor file sync requests, we had to rule out that.

Revision history for this message
Laurynas Biveinis (laurynas-biveinis) wrote :

Bug 1015989 is another spin-off of this one.

Revision history for this message
yinfeng (yinfeng-zwx) wrote :

this bug(http://bugs.mysql.com/bug.php?id=66079) may also make innodb_overwrite_relay_log_info working wrong.

Revision history for this message
Laurynas Biveinis (laurynas-biveinis) wrote :

Thanks for the heads-up on the upstream 66079. I think it this is not a correctness but a performance bug and it does not affect the crash-resistant replication in any way, because the latter does not share this code path.

Revision history for this message
yinfeng (yinfeng-zwx) wrote :

Yes, you are right.
I was too hasty :(

tags: added: crash-resistant-slave-5.5
Revision history for this message
Laurynas Biveinis (laurynas-biveinis) wrote :

Everybody affected by this bug -

Can you still reproduce it, or have there been any past incidents, with --relay_log_recovery set to ON? The requirement for this setting has not been documented in our crash-resistant slave docs.

Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PS-542

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.