Slave stop replaying relay log with no errors

Bug #1245141 reported by Bogdan
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona Server moved to https://jira.percona.com/projects/PS
Incomplete
Undecided
Unassigned

Bug Description

Percona Server 5.5.33-rel31.1.566.rhel6.x86_64 on CentO 6 with all updates installed.
I have replication broken three times: in a beginning of the October, yesterday and today.
So I decided to gather info and report the bug.

The issue is: slave do not replay relay logs, i.e. it downloads binlogs from master, but data in relay-log.info do not update and Seconds_Behind_Master grow up.
After slave restart via 'mysqladmin shutdown', I see "Seconds_Behind_Master: 0" in slave status, Exec_Master_Log_Pos, Relay_Log_File are same as before restart, but events continue to arrive to new relay logs. So I think 'event transfer' thread is OK, but ' event replay' thread hangs for some reason. MySQL use 100% CPU according by 'top', so I think something goes wrong with replay thread.
The 'master.info' file updates well before and after restart.

I can provide relay logs from slave and binlog from master privately.
Replication format is 'Mixed'

According to atop, CPU consumer has thread ID 16414, so I attached to it with gdb and gather some backtraces.

Btw, while I investigating an issue an gathering debug info (one or two hours after restart) Seconds_Behind_Master get reasonable value: 25746 and continue to grow up.

I would like to gather more info next time, so provide me an instruction.
Direct access to system may be granted upon request.

Thank you in advance!

Revision history for this message
Bogdan (bogdar) wrote :
Revision history for this message
Bogdan (bogdar) wrote :

Hello!
I have this bug reproduced again on Oct 28, 29 and today. Additionally, I updated Percona server to 55-5.5.34-rel32.0.591.rhel6.x86_64, perform 'reset slave' and 'change master' and system still hangs on some statement. Additionally, I noticed that I globally switched from 'repeatable read' to 'read committed' few days ago and since that replication never works more than 20 hours, so I guest this may be caused by increased number of 'raw' events in replication.

Revision history for this message
Valerii Kravchuk (valerii-kravchuk) wrote :

Can you, please, check if you have any tables without PRIMARY/UNIQUE keys on slave? Run the query like this:

select table_schema, table_name
from information_schema.columns
where table_schema not in ('mysql', 'information_schema', 'performance_schema')
group by table_schema,table_name
having sum(if(column_key in ('PRI','UNI'), 1,0)) = 0;

and check if any of the tables are found and if they are big enough. Note that this query may run for a long time if you have many tables in your databases.

Changed in percona-server:
status: New → Incomplete
Revision history for this message
Bogdan Rudas (bogdar-b) wrote :

Hello!

Yes we have some silly GeoIP tables in DB, and yes, replication crash at 04:30 when we start updating GeoIP tables with queries like:

INSERT IGNORE INTO `Geo_IP` VALUES
INSERT INTO `Geo_MaxMind_IP` VALUES {$values}

UPDATE `Geo_IP` SET `Deleted` = 1 WHERE `Location_ID` IN #a AND `Source` IN (...)"
DELETE FROM `Geo_IP` WHERE `Location_ID` IN #a AND `Deleted` = 1

INSERT IGNORE INTO `Geo_IP` (
                                            SELECT `mid`.`Location_ID`,
                                                   `mip`.`Interval_From` as `from`,
                                                   `mip`.`Interval_To` as `to`,
                                                   `mip`.`Interval_Index` as `index`,
                                                   'maxmind', 0
                                            FROM `Geo_MaxMind_ID` AS `mid`
                                            INNER JOIN `Geo_MaxMind_IP` AS `mip` ON `mip`.`MaxMind_ID` = `mid`.`ID`
                                            WHERE `mid`.`Location_ID` IN #a)

GeoIP and Geo_MaxMind_IP are about 2M records, data files are 165 and100Mb.

I will disable GeoIP updates and check if replication will survive this night, then I will manually start GeoIP update to ensure that it brake replication.
An can perform some other tests if necessary.

Revision history for this message
Bogdan Rudas (bogdar-b) wrote :

Thank you for tip, after creating primary key in some tables replication works well.
However, I can drop primary key and reproduce the bug again if it will be necessary to gather some more info about the bug.

Revision history for this message
Valerii Kravchuk (valerii-kravchuk) wrote :

Then it seems a known upstream bug: http://bugs.mysql.com/bug.php?id=53375.

The bug was formally "fixed" by adding some message to the error log, but then real fix happened in MySQL 5.6 (and Percona Server 5.6, as a result). Read http://geek.rohitkalhans.com/2012/08/Batch-operations-in-RBR.html about some details.

Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PS-3058

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.