slave IO_THREAD blocks replication - sql thread

Bug #1057087 reported by MF
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MySQL Server
Unknown
Unknown
Percona Server moved to https://jira.percona.com/projects/PS
Invalid
Undecided
Unassigned
5.1
Won't Fix
Medium
Unassigned
5.5
Triaged
Medium
Unassigned
5.6
Invalid
Undecided
Unassigned

Bug Description

After snapshot with Xtrabackup, slave was restored with logical lag about 6 hours. Replication started, no error. Than lag (by Seconds_Behind_Master, Relay_Master_Log_File and other variables from SHOW SLAVE STATUS) was only grew about ~30 min every 1 hour (this is core of reported problem). All master logs was transfered by IO thread immediately (1G LAN network). One day later lag was more than 24 hours. After
STOP SLAVE IO_THREAD;
was slave up2date during <1 hod (with Seconds_Behind_Master: 0).
Than IO_THREAD was started again and slave now runs about one day without any lag (there is monitoring).

Sumary
with running SLAVE IO_THREAD replication goes to grew lag and not catch master during 1 day.
Withhout running SLAVE IO_THREAD replication catch up master <1 hod.
Something wrong!?

Slave is Xeon E5 with SSD for data and SAS 10k for logs. Master is "only" Xeon 56xx with SAS 15k + SAS 10k. All in RAID 1. Acording to our deployment benchmark is slave faster than master (logicaly).
There wasn't any task on slave, which can slow down replication (as raid check, backup, ...).

Both are Percona-Server-server-55-5.5.27-rel28.1.296. Master is Centos 6.x and slave is Centos 5.x in up2date slave.

Revision history for this message
MF (fuxa-kos) wrote :

during lag status was
slave
| 2 | system user | | | Connect | 10583 | Waiting for master to send event | | 0 | 0 | 1 |
| 3 | system user | | | Connect | 35127 | Reading event from the relay log | | 0 | 0 | 1 |

master
| 10174490 | repl_isp_pr | ...:43412 | | Binlog Dump | 10607 | Master has sent all binlog to slave; waiting for binlog to be updated |

Revision history for this message
Valerii Kravchuk (valerii-kravchuk) wrote :

Looks related to http://bugs.mysql.com/bug.php?id=53167 (see my last comment there), http://bugs.mysql.com/bug.php?id=56363 and http://bugs.mysql.com/bug.php?id=66868. Looks like this happens at relay log rotation, so when no new relay logs appear there is no problem.

Revision history for this message
MF (fuxa-kos) wrote :

I confirm, there was many relay log rotations. I look at master brunch bugs.

Revision history for this message
Valerii Kravchuk (valerii-kravchuk) wrote :

Based on the upstream http://bugs.mysql.com/bug.php?id=66868 status and last comment, this is a confirmed bug in 5.1.x and 5.5.x, while in 5.6.x the problem is fixed in upstream.

Revision history for this message
Erez Zarum (erezzarum-q) wrote :

It's seems as this is a critical bug, are there any estimation for a fix in 5.1.x?

Revision history for this message
Laurynas Biveinis (laurynas-biveinis) wrote :

Erez -

Unfortunately currently we don't have any estimate.

Revision history for this message
Erez Zarum (erezzarum-q) wrote :

after upgrading to 5.1.68 my slaves became to lag for no reason up to yesterday when the lag went up to over 1000s (i have never had a lag for over 5s in the last 2 years, there were no changes in the schema/application), it's seems very critical, perhaps someone should write it in the release notes.

Revision history for this message
Laurynas Biveinis (laurynas-biveinis) wrote :

Erez -

What version did you upgrade from? I am not sure that bugs.mysql.com/bug.php?id=66868 is a regression introduced in 5.1.68.

Revision history for this message
Erez Zarum (erezzarum-q) wrote :

Laurynas,
I have upgraded from 5.1.67 to 5.1.68 on the master, and the slaves from 5.1.66 to 5.1.68

Revision history for this message
Laurynas Biveinis (laurynas-biveinis) wrote :

Erez -

http://bugs.mysql.com/bug.php?id=66868 was not introduced between 5.1.66/67 and 5.1.68. Thus, since your replication started lagging after having performed an upgrade from 66/67 to 68, I wouldn't assume that the current bug is to blame and would troubleshoot the issue to determine the actual cause first.

Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PS-2808

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.