Semi-sync replication performance degrades with high number of threads

Bug #1413836 reported by Jericho Rivera on 2015-01-23
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MySQL Server
Unknown
Unknown
Percona Server moved to https://jira.percona.com/projects/PS
Status tracked in 5.7
5.5
Invalid
Undecided
Unassigned
5.6
Fix Released
Medium
Unassigned
5.7
Triaged
Medium
Unassigned

Bug Description

* Taken verbatim from mysql bug report filed by Rene Cannao

Description:

Based on our experience (issues in productions and easily reproducible in testing environment), when semi-sync replication is enabled mysqld is able to handle around 6k TRX/s when the number of threads running is relatively low (100-200 connections) but degrades when the number of threads running go beyond a certain threshold. With 3000 threads running, throughput is no more than 300 TRX/s .

While at low number of threads running the bottleneck seems to be network rtt , at high number of threads network activity drops, and we also notice a constantly high number of context switches.

Trying to combine the output of pt-pmp (I think the relevant part is what is listed below) and semi-sync source code, we believe there is a high contention on LOCK_binlog_ in order to compare binlog coordinates of ACK from slave(s) and the binlog coordinates of each thread that issued a commit, and what seems an inefficient way to wake up threads.

   2459 pthread_cond_timedwait,ReplSemiSyncMaster::cond_timewait(semisync_master.so),ReplSemiSyncMaster::commitTrx(semisync_master.so),Trans_delegate::after_commit,MYSQL_BIN_LOG::process_after_commit_stage_queue(binlog.cc:6833),MYSQL_BIN_LOG::ordered_commit(binlog.cc:7217),MYSQL_BIN_LOG::commit(binlog.cc:6609),ha_commit_trans(handler.cc:1500),trans_commit,mysql_execute_command(sql_parse.cc:4711),mysql_parse(sql_parse.cc:6744),dispatch_command(sql_parse.cc:1432),do_handle_one_connection,handle_one_connection,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6),??
    541 pthread_cond_wait,Stage_manager::enroll_for(mysql_thread.h:1162),MYSQL_BIN_LOG::ordered_commit(binlog.cc:6897),MYSQL_BIN_LOG::commit(binlog.cc:6609),ha_commit_trans(handler.cc:1500),trans_commit,mysql_execute_command(sql_parse.cc:4711),mysql_parse(sql_parse.cc:6744),dispatch_command(sql_parse.cc:1432),do_handle_one_connection,handle_one_connection,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6),??
...
      1 __lll_lock_wait(libpthread.so.0),_L_lock_1006(libpthread.so.0),pthread_mutex_lock(libpthread.so.0),ReplSemiSyncMaster::lock(semisync_master.so),ReplSemiSyncMaster::updateSyncHeader(semisync_master.so),Binlog_transmit_delegate::before_send_event,mysql_binlog_send(rpl_master.cc:1325),com_binlog_dump(rpl_master.cc:767),dispatch_command(sql_parse.cc:1644),do_handle_one_connection,handle_one_connection,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6),??

Each transaction thread waiting an ACK does the follow (simplified):
- hold a lock;
- in a loop ( while (is_on()) ) :
  - compare the ACK position with its own position;
  - it the ACK is still behind, it waits on a condition variable

In reportReplyBinlog , if there is at least one thread that is waiting an ACK , it sends a broadcast to all the threads.

This also create a contention on LOCK_log in MYSQL_BIN_LOG::ordered_commit .

What seems to be a design flaw is that all the threads wake up, and while it is likely that they will all return to the application in a scenarios with few threads running, with a lot of threads running perhaps only few will return to the application and most of them will go back to wait on the same condition variable. This creates a lot of context switch and CPU get too busy in performing such operations that is not be able to perform a lot of progress with replication.

Attached is the output of pt-pmp resulting from the follow command:
pt-pmp --iterations=1 --save-samples=/tmp/samples.pmp > /tmp/output.pmp
When this was executed, mysqld was processing the traffic generated by sysbench in a write intensive workload running 3000 threads.

How to repeat:
Setup semi-sync replication.

Run a write intensive workload against the master with few threads. Ex:
sysbench --max-requests=0 --max-time=7200 --test=oltp --mysql-user=rcannao --mysql-password=rcannao --mysql-db=rcannao --db-driver=mysql --db-ps-mode=disable --oltp-table-size=10000000 --oltp-point-selects=1 --oltp-index-updates=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --num-threads=30 run

Run a write intensive workload against the master with a lot of threads. Ex:
sysbench --max-requests=0 --max-time=7200 --test=oltp --mysql-user=rcannao --mysql-password=rcannao --mysql-db=rcannao --db-driver=mysql --db-ps-mode=disable --oltp-table-size=10000000 --oltp-point-selects=1 --oltp-index-updates=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --num-threads=3000 run

As a comparison, run the above workload without semi-sync enabled. Run this to verify that the bottleneck is semi-sync.

Suggested fix:
Few spin loops before suspending the thread with pthread_cond_timedwait()

tags: added: upstream
Download full text (5.0 KiB)

Verified with PS 5.6.

---------- With --num-threads=30 and normal replication

nilnandan@desktop:~$ sysbench --max-requests=0 --max-time=900 --test=/usr/share/doc/sysbench/tests/db/oltp.lua --mysql-user=root --mysql-password=msandbox --mysql-socket=/tmp/mysql_sandbox21087.sock --mysql-db=dbtest --db-driver=mysql --db-ps-mode=disable --oltp-table-size=10000000 --oltp-point-selects=1 --oltp-index-updates=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --num-threads=30 run
sysbench 0.5: multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 30
Random number generator seed is 0 and will be ignored
Threads started!

OLTP test statistics:
..
    transactions: 1980700 (2200.77 per sec.)
    read/write requests: 7922800 (8803.08 per sec.)
    other operations: 3961400 (4401.54 per sec.)
...
nilnandan@desktop:~$

---------- With --num-threads=30 and semi-sync replication

nilnandan@desktop:~$ nilnandan@desktop:~$ sysbench --max-requests=0 --max-time=900 --test=/usr/share/doc/sysbench/tests/db/oltp.lua --mysql-user=root --mysql-password=msandbox --mysql-socket=/tmp/mysql_sandbox21087.sock --mysql-db=dbtest --db-driver=mysql --db-ps-mode=disable --oltp-table-size=10000000 --oltp-point-selects=1 --oltp-index-updates=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --num-threads=30 run
sysbench 0.5: multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 30
Random number generator seed is 0 and will be ignored
Threads started!

OLTP test statistics:
   ...
    transactions: 1463881 (1626.51 per sec.)
    read/write requests: 5855524 (6506.03 per sec.)
    other operations: 2927762 (3253.02 per sec.)
...
nilnandan@desktop:~$

---------- With --num-threads=300 and normal replication

nilnandan@desktop:~$ sysbench --max-requests=0 --max-time=900 --test=/usr/share/doc/sysbench/tests/db/oltp.lua --mysql-user=root --mysql-password=msandbox --mysql-socket=/tmp/mysql_sandbox21087.sock --mysql-db=dbtest --db-driver=mysql --db-ps-mode=disable --oltp-table-size=10000000 --oltp-point-selects=1 --oltp-index-updates=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --num-threads=300 run
sysbench 0.5: multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 300
Random number generator seed is 0 and will be ignored
Threads started!

OLTP test statistics:
...
     transactions: 2425613 (2694.89 per sec.)
    read/write requests: 9702452 (10779.54 per sec.)
    other operations: 4851226 (5389.77 per sec.)
..

---------- With --num-threads=300 and semi-sync replication

nilnandan@desktop:~$ sysbench --max-requests=0 --max-time=900 --test=/usr/share/doc/sysbench/tests/db/oltp.lua --mysql-user=root --mysql-password=msandbox --mysql-socket=/tmp/mysql_sandbox21087.sock --mysql-db=dbtest --db-driver=mysql --db-ps-mode=disable --oltp-...

Read more...

Sveta Smirnova (svetasmirnova) wrote :

Rene found that Percona's builds don't use option -DBUILD_CONFIG=mysql_release which, in its turn, ensures that libaio and fast mutexes used on Linux. Also Percona's build does not add these options in any other way. I think we should not wait when Oracle fixes bug #75570 and build with these options. It won't fix the bug completely, but will make Percona's server not slower than Oracle's.

Shall we have a separate bug report for -DBUILD_CONFIG=mysql_release issue then?

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PS-1603

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Related blueprints

Remote bug watches

Bug watches keep track of this bug in other bug trackers.