performance tuning for deadlock detect switch

Bug #952920 reported by Hui Liu
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona Server moved to https://jira.percona.com/projects/PS
Status tracked in 5.7
5.1
Won't Fix
Wishlist
Unassigned
5.5
Triaged
Wishlist
Unassigned
5.6
Triaged
Wishlist
Unassigned
5.7
Fix Released
Wishlist
Unassigned

Bug Description

As for deadlock detect mechanism in Innodb, it's talked for long whether
we need recursive checking for deadlock for some specail scenario, such as
lots of concurrent updates for the same record.

In the Planet MySQL, it's recommended:
“InnoDB is much faster when deadlock detection is disabled for workloads with
a lot of concurrency and contention.”

We are suffering the scenario above, in one of Taobao's core application, Item Center(IC).
Most of the time, it's okay, while for some special sales promotion(about once per month),
it's very very bad, as lots of users of Taobao participated in.

Here is the oprofile result(simulated the online scenario):
 2 Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
  3 samples % symbol name
  4 2008672 84.8036 lock_deadlock_recursive
  5 91364 3.8573 lock_has_to_wait
  6 11216 0.4735 safe_mutex_lock
  7 9719 0.4103 ut_delay
  8 8047 0.3397 MYSQLparse(void*)
  9 7938 0.3351 lock_rec_has_to_wait_in_queue
10 7788 0.3288 code_state
11 7601 0.3209 my_strnncoll_binary
12 6703 0.2830 dict_col_get_clust_pos_noninline
13 6598 0.2786 _db_enter_
14 6451 0.2724 _db_return_
15 5733 0.2420 _db_doprnt_
16 5503 0.2323 rec_get_offsets_func
17 5325 0.2248 ha_innobase::update_row(unsigned char const*, unsigned char*)
18 5241 0.2213 mutex_spin_wait
19 4931 0.2082 build_template(row_prebuilt_struct*, THD*, st_table*, unsigned int)
20 4655 0.1965 lock_rec_convert_impl_to_expl

As you can see, it's soo bad for lock_detect_recursive function. So we added a switch
to disable the deadlock detect dynamically. For IC application, there is almost no
deadlock as business SQL logic is tuned, so there seems no risk yet.

To make the scenario repeatable, a test case is provided with the data we hit (tweak for sensitive columns) later and the related patch. Please help to have a review.

Revision history for this message
Hui Liu (hickey) wrote :

Steps to run:

1. unzip the deadlock.7zip
2. start test with "sh run.sh" to observe the results of perf data.
3. apply the patch and turn the switch off to get the result again:

root@(none) 03:40:21>set global innodb_deadlock_detect=off;
Query OK, 0 rows affected (0.00 sec)

root@(none) 03:40:33>show variables like '%detect%';
+------------------------+-------+
| Variable_name | Value |
+------------------------+-------+
| innodb_deadlock_detect | OFF |
+------------------------+-------+
1 row in set (0.00 sec)

The result we test is summarized as:
1000000 queries, concurrency=16
2124 vs 1971 seconds,
1000000 queries, concurrency=700
33569 vs 2612 seconds

As we can see, the patched switch promoted the perf much for the large concurrency scenario.

yinfeng (yinfeng-zwx)
Changed in percona-server:
assignee: nobody → yinfeng (yinfeng-zwx)
assignee: yinfeng (yinfeng-zwx) → nobody
Revision history for this message
Hui Liu (hickey) wrote :

For the later test, we found the another interesting issue (not using the separated purge thread feature):
When the history list length to be purged grows larger and large (more than 300K) which occurs 30mins later in the test scenario, one loop of purging(10 seconds interval) lasts from seconds to several minutes, and server's TPS drops very very much during this period.

Therefore, to control the purging stage for some special case, we need to control the purging number of pages for stable performance, yet another dynamic variables is introduced: innodb_max_purge_size. Once the large purge begins, we need to set innodb_max_purge_size to be some value, such as 500, the threshold of pages to be purged for one purge chance. And it's reset to 0(no control for purge) if the history list length drops to normal value. The test result shows that it works as we expected, stable performance during large purging stage.

The new patch is attached, with this new variable and related test cases/results.

Stewart Smith (stewart)
Changed in percona-server:
importance: Undecided → Medium
importance: Medium → Wishlist
tags: added: contribution
Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

I got the following conflict when applying the patch:

patch -Np1 < /tmp/deadlock_detect_switch.diff
patching file storage/innobase/handler/ha_innodb.cc
Hunk #1 succeeded at 111 (offset 1 line).
Hunk #2 succeeded at 12387 (offset 252 lines).
Hunk #3 succeeded at 12914 (offset 298 lines).
Hunk #4 succeeded at 12945 (offset 301 lines).
patching file storage/innobase/lock/lock0lock.c
Hunk #1 succeeded at 377 (offset 1 line).
Hunk #2 succeeded at 1812 (offset 2 lines).
Hunk #3 succeeded at 3822 (offset 3 lines).
patching file mysql-test/suite/sys_vars/t/deadlock_detect_basic.test
patching file mysql-test/suite/sys_vars/t/max_purge_size_basic.test
patching file storage/innobase/srv/srv0srv.c
Hunk #1 succeeded at 548 (offset 3 lines).
Hunk #2 succeeded at 3966 (offset 70 lines).
patching file mysql-test/r/percona_server_variables_release.result
Hunk #1 succeeded at 97 (offset 1 line).
Hunk #2 FAILED at 129.
1 out of 2 hunks FAILED -- saving rejects to file mysql-test/r/percona_server_variables_release.result.rej
patching file mysql-test/suite/sys_vars/r/all_vars.result
Hunk #1 succeeded at 14 (offset -1 lines).

Now, that reject file is:

--- mysql-test/r/percona_server_variables_release.result
+++ mysql-test/r/percona_server_variables_release.result
@@ -129,6 +130,7 @@
 INNODB_LOG_GROUP_HOME_DIR
 INNODB_MAX_DIRTY_PAGES_PCT
 INNODB_MAX_PURGE_LAG
+INNODB_MAX_PURGE_SIZE
 INNODB_MIRRORED_LOG_GROUPS
 INNODB_OLD_BLOCKS_PCT
 INNODB_OLD_BLOCKS_TIME

That brings me to the question:

Why do you have INNODB_MAX_PURGE_SIZE when same effect can be accomplished with innodb-purge-batch-size. The batch determines the number of records purged by trx_purge. Can you elaborate on this?

Revision history for this message
Laurynas Biveinis (laurynas-biveinis) wrote :

The percona_server_variables_release conflict is trivial and needs a simple re-record. It does not affect the core of the patch.

Changed in percona-server:
status: New → Incomplete
Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

@Laurynas, ah, yes, the conflict is trivial, the purpose of mentioning it was for the question regarding innodb-purge-batch-size (and to imply that INNODB_MAX_PURGE_SIZE may not be required).

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for Percona Server because there has been no activity for 60 days.]

Changed in percona-server:
status: Incomplete → Expired
Changed in percona-server:
status: Expired → New
tags: added: xtradb
Revision history for this message
Hui Liu (hickey) wrote :

Further perf tuning for high concurrency of hot records would ref to google-discussion: https://groups.google.com/forum/?fromgroups#!topic/percona-discussion/z3r4-Qm0oYg

tags: added: performance
Revision history for this message
Laurynas Biveinis (laurynas-biveinis) wrote :

MySQL 5.7.15 has implemented innodb_deadlock_detect - closing as Fix Released

Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PS-2370

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.