pt-archiver --check-slave-lag agressiveness

Bug #1056507 reported by Jonathan Nicol
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Percona Toolkit moved to https://jira.percona.com/projects/PT
Fix Released
High
Frank Cizmich

Bug Description

running pt-archiver 2.1.3 to delete old data from a table. It's going rather slowly, and the statistics show ~95% of the time is being spent on Action "other". Suspecting it might be too aggressive checking the slave lag I added --check-interval 10 --max-lag 10, which didn't give me any improvement. Still suspicious I ran tcpdump on the slave, and unless I'm misunderstanding something, it's being WAY too aggressive.

Command being used (tables/hosts obfuscated):
pt-archiver --purge --progress 50000 --statistics --bulk-delete --limit 10000 --commit-each --primary-key-only --check-slave-lag h=slave-host --check-interval 10 --max-lag 10 --source h=localhost,D=database,t=table --where "created_date < date_sub(now(), INTERVAL 6 month)"

While that was running on the master, I ran on the slave (using the port pt-archiver is connected to):

tcpdump -i bond0 port 49290 -s 65535 -x -n -q -tttt > tcpdump.txt
pt-query-digest tcpdump.txt --type tcpdump
...
# 2.1s user time, 80ms system time, 20.57M rss, 120.88M vsz
# Current date: Tue Sep 25 19:28:22 2012
# Hostname:
# Files: tcpdump.txt
# Overall: 1.16k total, 1 unique, 54.26 QPS, 0.01x concurrency ___________
# Time range: 2012-09-25 19:18:41.060553 to 19:19:02.511825
# Attribute total min max avg 95% stddev median
# ============ ======= ======= ======= ======= ======= ======= =======
# Exec time 113ms 28us 490us 96us 113us 15us 89us
# Rows affecte 0 0 0 0 0 0 0
# Query size 19.32k 17 17 17 17 0 17
# Warning coun 0 0 0 0 0 0 0

# Profile
# Rank Query ID Response time Calls R/Call Apdx V/M Item
# ==== ================== ============= ===== ====== ==== ===== ==========
# 1 0x791C5370A1021F19 0.1128 100.0% 1164 0.0001 1.00 0.00 SHOW SLAVE STATUS

# Query 1: 54.26 QPS, 0.01x concurrency, ID 0x791C5370A1021F19 at byte 3416555
# This item is included in the report because it matches --limit.
# Scores: Apdex = 1.00 [1.0], V/M = 0.00
# Query_time sparkline: | ^_ |
# Time range: 2012-09-25 19:18:41.060553 to 19:19:02.511825
# Attribute pct total min max avg 95% stddev median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count 100 1164
# Exec time 100 113ms 28us 490us 96us 113us 15us 89us
# Rows affecte 0 0 0 0 0 0 0 0
# Query size 100 19.32k 17 17 17 17 0 17
# Warning coun 0 0 0 0 0 0 0 0
# String:
# Errors none
# Hosts
# Query_time distribution
# 1us
# 10us ################################################################
# 100us ####################################
# 1ms
# 10ms
# 100ms
# 1s
# 10s+
SHOW SLAVE STATUS\G

I don't think it should be running "SHOW SLAVE STATUS" 55 times per second.

tags: added: pt-archiver replication-lag useless-work
Changed in percona-toolkit:
status: New → Triaged
Changed in percona-toolkit:
status: Triaged → In Progress
importance: Undecided → High
assignee: nobody → Frank Cizmich (frank-cizmich)
milestone: none → 2.3.1
Revision history for this message
Michael Wang (xw73) wrote :

The slowness problem can be easily fixed by the following change:

# diff -u /usr/bin/pt-archiver ../pt-archiver
--- /usr/bin/pt-archiver 2015-01-23 10:30:15.000000000 +0000
+++ ../pt-archiver 2015-04-28 22:20:13.885862565 +0000
@@ -6075,6 +6075,8 @@
    my $first_row = [ @$row ];
    my $csv_row;

+ my $count;
+
    ROW:
    while ( # Quit if:
       $row # There is no data
@@ -6304,13 +6306,17 @@
          PTDEBUG && _d('Got another row in this chunk');
       }

+ $count++;
+
       # Check slave lag and wait if slave is too far behind.
- if ( $lag_dbh ) {
+ if ( $lag_dbh && $count % 100 == 0 ) {

which checks the replication lag every 100 rows instead of every 1 row. This significantly improves the efficiency.

Changed in percona-toolkit:
status: In Progress → Fix Committed
Revision history for this message
Frank Cizmich (frank-cizmich) wrote :

Fixed by checking every 100 rows. As Michael suggested :-)

Changed in percona-toolkit:
status: Fix Committed → Fix Released
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PT-332

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.