Percona Server moved to https://jira.percona.com/projects/PS

Slave stop replaying relay log with no errors

Bug #1245141 reported by Bogdan on 2013-10-27

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Percona Server moved to https://jira.percona.com/projects/PS	Incomplete	Undecided	Unassigned

Bug Description

Percona Server 5.5.33-rel31.1.566.rhel6.x86_64 on CentO 6 with all updates installed.
I have replication broken three times: in a beginning of the October, yesterday and today.
So I decided to gather info and report the bug.

The issue is: slave do not replay relay logs, i.e. it downloads binlogs from master, but data in relay-log.info do not update and Seconds_Behind_Master grow up.
After slave restart via 'mysqladmin shutdown', I see "Seconds_Behind_Master: 0" in slave status, Exec_Master_Log_Pos, Relay_Log_File are same as before restart, but events continue to arrive to new relay logs. So I think 'event transfer' thread is OK, but ' event replay' thread hangs for some reason. MySQL use 100% CPU according by 'top', so I think something goes wrong with replay thread.
The 'master.info' file updates well before and after restart.

I can provide relay logs from slave and binlog from master privately.
Replication format is 'Mixed'

According to atop, CPU consumer has thread ID 16414, so I attached to it with gdb and gather some backtraces.

Btw, while I investigating an issue an gathering debug info (one or two hours after restart) Seconds_Behind_Master get reasonable value: 25746 and continue to grow up.

I would like to gather more info next time, so provide me an instruction.
Direct access to system may be granted upon request.

Thank you in advance!

Revision history for this message

Bogdan (bogdar) wrote on 2013-10-27:

pt-pmp -i 10, GDB backtraces, configs from master and slave, relay-log.info and master.info, status and variables from slave before restart etc... Edit (44.5 KiB, application/x-tar)

Revision history for this message

Bogdan (bogdar) wrote on 2013-10-30:

Hello!
I have this bug reproduced again on Oct 28, 29 and today. Additionally, I updated Percona server to 55-5.5.34-rel32.0.591.rhel6.x86_64, perform 'reset slave' and 'change master' and system still hangs on some statement. Additionally, I noticed that I globally switched from 'repeatable read' to 'read committed' few days ago and since that replication never works more than 20 hours, so I guest this may be caused by increased number of 'raw' events in replication.

Revision history for this message

Valerii Kravchuk (valerii-kravchuk) wrote on 2013-10-30:

Can you, please, check if you have any tables without PRIMARY/UNIQUE keys on slave? Run the query like this:

select table_schema, table_name
from information_schema.columns
where table_schema not in ('mysql', 'information_schema', 'performance_schema')
group by table_schema,table_name
having sum(if(column_key in ('PRI','UNI'), 1,0)) = 0;

and check if any of the tables are found and if they are big enough. Note that this query may run for a long time if you have many tables in your databases.

Changed in percona-server:
status:	New → Incomplete

Revision history for this message

Bogdan Rudas (bogdar-b) wrote on 2013-10-30:

Hello!

Yes we have some silly GeoIP tables in DB, and yes, replication crash at 04:30 when we start updating GeoIP tables with queries like:

INSERT IGNORE INTO `Geo_IP` VALUES
INSERT INTO `Geo_MaxMind_IP` VALUES {$values}

UPDATE `Geo_IP` SET `Deleted` = 1 WHERE `Location_ID` IN #a AND `Source` IN (...)"
DELETE FROM `Geo_IP` WHERE `Location_ID` IN #a AND `Deleted` = 1

INSERT IGNORE INTO `Geo_IP` (
                                            SELECT `mid`.`Location_ID`,
                                                   `mip`.`Interval_From` as `from`,
                                                   `mip`.`Interval_To` as `to`,
                                                   `mip`.`Interval_Index` as `index`,
                                                   'maxmind', 0
                                            FROM `Geo_MaxMind_ID` AS `mid`
                                            INNER JOIN `Geo_MaxMind_IP` AS `mip` ON `mip`.`MaxMind_ID` = `mid`.`ID`
                                            WHERE `mid`.`Location_ID` IN #a)

GeoIP and Geo_MaxMind_IP are about 2M records, data files are 165 and100Mb.

I will disable GeoIP updates and check if replication will survive this night, then I will manually start GeoIP update to ensure that it brake replication.
An can perform some other tests if necessary.

Revision history for this message

Bogdan Rudas (bogdar-b) wrote on 2013-10-31:

Thank you for tip, after creating primary key in some tables replication works well.
However, I can drop primary key and reproduce the bug again if it will be necessary to gather some more info about the bug.

Revision history for this message

Valerii Kravchuk (valerii-kravchuk) wrote on 2013-10-31:

Then it seems a known upstream bug: http://bugs.mysql.com/bug.php?id=53375.

The bug was formally "fixed" by adding some message to the error log, but then real fix happened in MySQL 5.6 (and Percona Server 5.6, as a result). Read http://geek.rohitkalhans.com/2012/08/Batch-operations-in-RBR.html about some details.