stalls after roughly 900K queries

Bug #1269557 reported by Chris Hoffman
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MOVED - Percona Playback
New
Undecided
Unassigned

Bug Description

I have been trying to playback three days of our production logs on our new host. The process is currently stalled after executing 880,500 queries (teeing the output to a file and grepped for microseconds). Here is what the process looks like from `ps aux`:

  ncadmin 10648 98.8 0.0 7929536 65804 pts/10 Sl+ Jan14 1011:34 percona-playback

strace reports the following (attached to existing pid):

  futex(0x95d8dc, FUTEX_WAIT_PRIVATE, 1, NULL

I'm not sure what other information will be helpful.

Tags: i45771
Revision history for this message
Ian Bissett (bisscuitt-e) wrote :

I'm currently experiencing this same bug:

Version: 0.7

Command ran:
percona-playback --query-log-file /mnt/logs/mysql/mysql-slow-db2a_dal5sl.log --thread-pool-threads-count=125 --dispatcher-plugin thread-pool

Runs for several thousand queries then stalls, after a few minutes will restart before stalling again.

Strace on the parent process during stall:
[pid 27374] futex(0x7f2a87fd6de0, FUTEX_WAIT, 2, NULL <unfinished ...>
[pid 27373] futex(0x7f2a887d7de0, FUTEX_WAIT, 2, NULL <unfinished ...>
[pid 27372] futex(0x7f2a88fd8de0, FUTEX_WAIT, 2, NULL <unfinished ...>
[pid 27371] futex(0x7f2a897d9de0, FUTEX_WAIT, 2, NULL <unfinished ...>
[pid 27370] futex(0x7f2a89fdade0, FUTEX_WAIT, 2, NULL <unfinished ...>
[pid 27369] futex(0x7f2a8a7dbde0, FUTEX_WAIT, 2, NULL <unfinished ...>
[pid 27368] futex(0x7f2a8afdcde0, FUTEX_WAIT, 2, NULL <unfinished ...>
[pid 27367] futex(0x7f2a8b7ddde0, FUTEX_WAIT, 2, NULL <unfinished ...>
[pid 27366] futex(0x7f2a8bfdede0, FUTEX_WAIT, 2, NULL <unfinished ...>
[pid 27365] futex(0x7f2a8c7dfde0, FUTEX_WAIT, 2, NULL <unfinished ...>
[pid 27364] futex(0x7f2a8cfe0de0, FUTEX_WAIT, 2, NULL <unfinished ...>
[pid 27363] futex(0x7f2a8d7e1de0, FUTEX_WAIT, 2, NULL <unfinished ...>
[pid 27362] futex(0x7f2a8dfe2de0, FUTEX_WAIT, 2, NULL <unfinished ...>
[pid 27361] futex(0x7f2a8e7e3de0, FUTEX_WAIT, 2, NULL <unfinished ...>
[pid 27360] futex(0x7f2a8efe4de0, FUTEX_WAIT, 2, NULL <unfinished ...>
[pid 27359] futex(0x7f2a8f7e5de0, FUTEX_WAIT, 2, NULL <unfinished ...>
[pid 27358] futex(0x7f2a8ffe6de0, FUTEX_WAIT, 2, NULL <unfinished ...>

Revision history for this message
Ian Bissett (bisscuitt-e) wrote :

Note: During this time the connections to MySQL (Percona Server 5.5.39-36.0) are still open and completely idle.

A strace of mysql show no activity on those connections.

Revision history for this message
Ian Bissett (bisscuitt-e) wrote :

Also seeing the following during pauses (Parent PID: 24982)
[pid 30146] sched_yield() = 0
[pid 30146] sched_yield() = 0
[pid 30146] sched_yield() = 0
[pid 30146] sched_yield() = 0
[pid 30146] sched_yield() = 0
[pid 30146] sched_yield() = 0
[pid 30146] sched_yield() = 0
[pid 30146] sched_yield() = 0
[pid 30146] sched_yield() = 0
[pid 30146] sched_yield() = 0
[pid 30146] sched_yield() = 0
[pid 30146] sched_yield() = 0
[pid 30146] sched_yield() = 0
[pid 30146] sched_yield() = 0
[pid 30146] sched_yield() = 0
[pid 30146] sched_yield() = 0
[pid 30146] sched_yield() = 0
[pid 30146] sched_yield() = 0

Revision history for this message
Ian Bissett (bisscuitt-e) wrote :

Further investigation appears to show that all the percona-playback threads are being locked by a single process in the mysql process tree.

Killing the mysql thread revives percona-playback. This only seems applicable to the `sched_yield() = 0` issue.

tags: added: i45771
Revision history for this message
Peiran Song (peiran-song) wrote :

I didn’t reproduce the stall with a simple test case. Test case as below:

percona-playback: Version: 0.7
MySQL: community 5.6.19

1) set long_query_time to 0 to write all queries to slow log

2) Use sysbench to prepare 64 tables, each with 5K records
# sysbench --test=/usr/share/doc/sysbench/tests/db/multi_table_rw.lua --oltp-table-size=5000 --mysql-table-engine=innodb --tables=64 --mysql-user=root --mysql-password=secret --mysql-socket=/var/lib/mysql/mysql.sock prepare

3) Run 100K queries with 125 threads
# sysbench --test=/usr/share/doc/sysbench/tests/db/multi_table_rw.lua --oltp-table-size=5000 --mysql-table-engine=innodb --tables=64 --mysql-user=root --mysql-password=secret --mysql-socket=/var/lib/mysql/mysql.sock --num-threads=125 --max-requests=100000 run

4) Cleared sysbench tables and ran percona-playback
# percona-playback --query-log-file /var/lib/mysql/slow.log.saved --thread-pool-threads-count=125 --dispatcher-plugin thread-pool --mysql-username=root --mysql-password=secret --mysql-schema=sbtest > playback.out 2>playback.err

No stall was observed, ~200K queries were executed.

I understand this is much simpler comparing to a production workload. How many databases and tables are involved in your case? How many concurrent threads(Threads_running do you normally have?

Can you try with less threads, like --thread-pool-threads-count=60 ?

Revision history for this message
Ian Bissett (bisscuitt-e) wrote :

Hi Perian,

I think this is actually related to a problem in MySQL and not percona-playback. It seems that some queries are stalling in MySQL and percona-playback is backing off from running more queries until the stalled queries are complete.

I believe that this is the correct behaviour for percona-playback.

If I run with less threads the issue does still appear. I am testing with 9193 databases (1478097 tables) using a total of 734GB.

Report from percona-playback:

Report
------
Executed 6389650 queries
Spent 03:50:46.761655 executing queries versus an expected 00:38:35.135917 time.
1054099 queries were quicker than expected, 5335550 were slower
A total of 244 queries had errors.
Expected 12548225 rows, got 12563629 (a difference of 15404)
Number of queries where number of rows differed: 8281.

Average of 16.47 queries per connection (387840 connections).

Revision history for this message
Peiran Song (peiran-song) wrote :

Hi Ian,

Thank you for the update. It is good to know that it no longer appears to be a percona-playback bug.

I sent you a reply via the support issue as well.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.