pt-archiver --check-slave-lag agressiveness
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Percona Toolkit moved to https://jira.percona.com/projects/PT |
Fix Released
|
High
|
Frank Cizmich |
Bug Description
running pt-archiver 2.1.3 to delete old data from a table. It's going rather slowly, and the statistics show ~95% of the time is being spent on Action "other". Suspecting it might be too aggressive checking the slave lag I added --check-interval 10 --max-lag 10, which didn't give me any improvement. Still suspicious I ran tcpdump on the slave, and unless I'm misunderstanding something, it's being WAY too aggressive.
Command being used (tables/hosts obfuscated):
pt-archiver --purge --progress 50000 --statistics --bulk-delete --limit 10000 --commit-each --primary-key-only --check-slave-lag h=slave-host --check-interval 10 --max-lag 10 --source h=localhost,
While that was running on the master, I ran on the slave (using the port pt-archiver is connected to):
tcpdump -i bond0 port 49290 -s 65535 -x -n -q -tttt > tcpdump.txt
pt-query-digest tcpdump.txt --type tcpdump
...
# 2.1s user time, 80ms system time, 20.57M rss, 120.88M vsz
# Current date: Tue Sep 25 19:28:22 2012
# Hostname:
# Files: tcpdump.txt
# Overall: 1.16k total, 1 unique, 54.26 QPS, 0.01x concurrency ___________
# Time range: 2012-09-25 19:18:41.060553 to 19:19:02.511825
# Attribute total min max avg 95% stddev median
# ============ ======= ======= ======= ======= ======= ======= =======
# Exec time 113ms 28us 490us 96us 113us 15us 89us
# Rows affecte 0 0 0 0 0 0 0
# Query size 19.32k 17 17 17 17 0 17
# Warning coun 0 0 0 0 0 0 0
# Profile
# Rank Query ID Response time Calls R/Call Apdx V/M Item
# ==== ================== ============= ===== ====== ==== ===== ==========
# 1 0x791C5370A1021F19 0.1128 100.0% 1164 0.0001 1.00 0.00 SHOW SLAVE STATUS
# Query 1: 54.26 QPS, 0.01x concurrency, ID 0x791C5370A1021F19 at byte 3416555
# This item is included in the report because it matches --limit.
# Scores: Apdex = 1.00 [1.0], V/M = 0.00
# Query_time sparkline: | ^_ |
# Time range: 2012-09-25 19:18:41.060553 to 19:19:02.511825
# Attribute pct total min max avg 95% stddev median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count 100 1164
# Exec time 100 113ms 28us 490us 96us 113us 15us 89us
# Rows affecte 0 0 0 0 0 0 0 0
# Query size 100 19.32k 17 17 17 17 0 17
# Warning coun 0 0 0 0 0 0 0 0
# String:
# Errors none
# Hosts
# Query_time distribution
# 1us
# 10us #######
# 100us #######
# 1ms
# 10ms
# 100ms
# 1s
# 10s+
SHOW SLAVE STATUS\G
I don't think it should be running "SHOW SLAVE STATUS" 55 times per second.
tags: | added: pt-archiver replication-lag useless-work |
Changed in percona-toolkit: | |
status: | New → Triaged |
Changed in percona-toolkit: | |
status: | Triaged → In Progress |
importance: | Undecided → High |
assignee: | nobody → Frank Cizmich (frank-cizmich) |
milestone: | none → 2.3.1 |
Changed in percona-toolkit: | |
status: | In Progress → Fix Committed |
Changed in percona-toolkit: | |
status: | Fix Committed → Fix Released |
The slowness problem can be easily fixed by the following change:
# diff -u /usr/bin/ pt-archiver ../pt-archiver pt-archiver 2015-01-23 10:30:15.000000000 +0000
--- /usr/bin/
+++ ../pt-archiver 2015-04-28 22:20:13.885862565 +0000
@@ -6075,6 +6075,8 @@
my $first_row = [ @$row ];
my $csv_row;
+ my $count;
+
ROW:
while ( # Quit if:
$row # There is no data
@@ -6304,13 +6306,17 @@
PTDEBUG && _d('Got another row in this chunk');
}
+ $count++;
+
# Check slave lag and wait if slave is too far behind.
- if ( $lag_dbh ) {
+ if ( $lag_dbh && $count % 100 == 0 ) {
which checks the replication lag every 100 rows instead of every 1 row. This significantly improves the efficiency.