Percona Toolkit moved to https://jira.percona.com/projects/PT

pt-archiver --check-slave-lag agressiveness

Bug #1056507 reported by Jonathan Nicol on 2012-09-25

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	Percona Toolkit moved to https://jira.percona.com/projects/PT	Fix Released	High	Frank Cizmich	Percona Toolkit moved to https://jira.percona.com/projects/PT 2.2.15

Bug Description

running pt-archiver 2.1.3 to delete old data from a table. It's going rather slowly, and the statistics show ~95% of the time is being spent on Action "other". Suspecting it might be too aggressive checking the slave lag I added --check-interval 10 --max-lag 10, which didn't give me any improvement. Still suspicious I ran tcpdump on the slave, and unless I'm misunderstanding something, it's being WAY too aggressive.

Command being used (tables/hosts obfuscated):
pt-archiver --purge --progress 50000 --statistics --bulk-delete --limit 10000 --commit-each --primary-key-only --check-slave-lag h=slave-host --check-interval 10 --max-lag 10 --source h=localhost,D=database,t=table --where "created_date < date_sub(now(), INTERVAL 6 month)"

While that was running on the master, I ran on the slave (using the port pt-archiver is connected to):

tcpdump -i bond0 port 49290 -s 65535 -x -n -q -tttt > tcpdump.txt
pt-query-digest tcpdump.txt --type tcpdump
...
# 2.1s user time, 80ms system time, 20.57M rss, 120.88M vsz
# Current date: Tue Sep 25 19:28:22 2012
# Hostname:
# Files: tcpdump.txt
# Overall: 1.16k total, 1 unique, 54.26 QPS, 0.01x concurrency ___________
# Time range: 2012-09-25 19:18:41.060553 to 19:19:02.511825
# Attribute total min max avg 95% stddev median
# ============ ======= ======= ======= ======= ======= ======= =======
# Exec time 113ms 28us 490us 96us 113us 15us 89us
# Rows affecte 0 0 0 0 0 0 0
# Query size 19.32k 17 17 17 17 0 17
# Warning coun 0 0 0 0 0 0 0

# Profile
# Rank Query ID Response time Calls R/Call Apdx V/M Item
# ==== ================== ============= ===== ====== ==== ===== ==========
# 1 0x791C5370A1021F19 0.1128 100.0% 1164 0.0001 1.00 0.00 SHOW SLAVE STATUS

# Query 1: 54.26 QPS, 0.01x concurrency, ID 0x791C5370A1021F19 at byte 3416555
# This item is included in the report because it matches --limit.
# Scores: Apdex = 1.00 [1.0], V/M = 0.00
# Query_time sparkline: | ^_ |
# Time range: 2012-09-25 19:18:41.060553 to 19:19:02.511825
# Attribute pct total min max avg 95% stddev median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count 100 1164
# Exec time 100 113ms 28us 490us 96us 113us 15us 89us
# Rows affecte 0 0 0 0 0 0 0 0
# Query size 100 19.32k 17 17 17 17 0 17
# Warning coun 0 0 0 0 0 0 0 0
# String:
# Errors none
# Hosts
# Query_time distribution
# 1us
# 10us ################################################################
# 100us ####################################
# 1ms
# 10ms
# 100ms
# 1s
# 10s+
SHOW SLAVE STATUS\G

I don't think it should be running "SHOW SLAVE STATUS" 55 times per second.

Tags:

Daniel Nichter (daniel-nichter) on 2012-10-02

tags:	added: pt-archiver replication-lag useless-work
Changed in percona-toolkit:
status:	New → Triaged

Frank Cizmich (frank-cizmich) on 2015-04-24

Changed in percona-toolkit:
status:	Triaged → In Progress
importance:	Undecided → High
assignee:	nobody → Frank Cizmich (frank-cizmich)
milestone:	none → 2.3.1

Revision history for this message

Michael Wang (xw73) wrote on 2015-04-29:

The slowness problem can be easily fixed by the following change:

# diff -u /usr/bin/pt-archiver ../pt-archiver
--- /usr/bin/pt-archiver 2015-01-23 10:30:15.000000000 +0000
+++ ../pt-archiver 2015-04-28 22:20:13.885862565 +0000
@@ -6075,6 +6075,8 @@
my $first_row = [ @$row ];
my $csv_row;

+ my $count;
+
    ROW:
    while ( # Quit if:
       $row # There is no data
@@ -6304,13 +6306,17 @@
          PTDEBUG && _d('Got another row in this chunk');
       }

+ $count++;
+
# Check slave lag and wait if slave is too far behind.
- if ( $lag_dbh ) {
+ if ( $lag_dbh && $count % 100 == 0 ) {

which checks the replication lag every 100 rows instead of every 1 row. This significantly improves the efficiency.

Frank Cizmich (frank-cizmich) on 2015-05-14

Changed in percona-toolkit:
status:	In Progress → Fix Committed

Revision history for this message

Frank Cizmich (frank-cizmich) wrote on 2015-07-26:

Fixed by checking every 100 rows. As Michael suggested :-)

Hrvoje Matijakovic (hrvojem) on 2015-08-28

Changed in percona-toolkit:
status:	Fix Committed → Fix Released

Revision history for this message

Shahriyar Rzayev (rzayev-sehriyar) wrote on 2018-01-24:

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PT-332

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.