Slave hangs when using MASTER_DELAY and parallel replication

Bug #1701554 reported by Vojtech Kurka
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona Server moved to https://jira.percona.com/projects/PS
Expired
Undecided
Unassigned

Bug Description

I have set
CHANGE MASTER TO MASTER_DELAY=1800;

slave_parallel_type=DATABASE
slave_parallel_workers=16
relay_log_space_limit=2G

Just after a relay log rotation, the slave hangs with this message "Slave_SQL_Running_State: Waiting until MASTER_DELAY seconds after master executed event".

$ mysql -e 'show slave status \G'
*************************** 1. row ***************************
               Slave_IO_State: Waiting for the slave SQL thread to free enough relay log space
                  Master_Host: ...
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: klarka4-bin.004857
          Read_Master_Log_Pos: 1070750282
               Relay_Log_File: klarka3-relay-bin.000019
                Relay_Log_Pos: 946873
        Relay_Master_Log_File: klarka4-bin.004856
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table: temp.%,mysql.%
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 946736
              Relay_Log_Space: 2147484820
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 6512
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 4
                  Master_UUID: a9999f14-0061-11e7-ae3f-0cc47ae26f1a
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 1800
          SQL_Remaining_Delay: 849
      Slave_SQL_Running_State: Waiting until MASTER_DELAY seconds after master executed event
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set: a9999f14-0061-11e7-ae3f-0cc47ae26f1a:1043480786-1044861551
            Executed_Gtid_Set: 0afbf8bc-46de-11e7-9dec-0cc47a6998fa:1-8,
1fd82bbe-ae2e-11e4-b7fa-002590c8b018:1-2093873,
a9999f14-0061-11e7-ae3f-0cc47ae26f1a:1-1044289525,
df922a0a-e491-11e5-9ffd-002590c8b7a8:1-2150706644,
f8f6d241-004d-11e7-a0ed-0cc47ae26da2:1-27999360
                Auto_Position: 1
         Replicate_Rewrite_DB:
                 Channel_Name:
           Master_TLS_Version:

When I run
mysql> STOP SLAVE IO_THREAD;

mysql> STOP SLAVE SQL_THREAD;

mysql> START SLAVE;

then the replication continues as usual. I hit this many times alread and it's a pain.

mysql> select version();
+---------------+
| version() |
+---------------+
| 5.7.18-15-log |
+---------------+
1 row in set (0.00 sec)

VK

Revision history for this message
Vojtech Kurka (vojtech-kurka) wrote :

When the slave hangs and I execut "STOP SLAVE", the command hangs also and I see this in error log:

2017-06-30T12:53:10.147865Z 14021 [Warning] Slave SQL for channel '': Coordinator thread of multi-threaded slave is being stopped in the middle of assigning a group of events; deferring to exit until the group completion ... , Error_code: 0
2017-06-30T12:54:12.150017Z 14021 [ERROR] Slave SQL for channel '': ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code: 1756

"STOP SLAVE" never finishes. I solved it by killing all slave threads in processlist, then the slave really stops and the command returns.

Revision history for this message
Sveta Smirnova (svetasmirnova) wrote :

Thank you for the report.

I cannot repeat described behavior unless slave really has to wait SQL_Remaining_Delay seconds for updates from master. Could you please show us several outputs of SHOW SLAVE STATUS, demonstrating what SQL_Remaining_Delay number stalled?

Changed in percona-server:
status: New → Incomplete
Revision history for this message
Vojtech Kurka (vojtech-kurka) wrote :
Download full text (50.3 KiB)

The slave was running well with slave_parallel_workers=0;
Then I turned on parallel slave:

> stop slave;
> set global slave_parallel_workers=12;
> start slave;

2017-07-10T15:21:21.333891Z 737328 [Note] Slave SQL thread for channel '' initialized, starting replication in log 'xhost_4-bin.005392' at position 1054734604, relay log '/ssd/mysql/relay/xhost_3-relay-bin.001631' position: 1052632458
2017-07-10T15:22:43.144296Z 737328 [Warning] Slave SQL for channel '': Coordinator thread of multi-threaded slave is being stopped in the middle of assigning a group of events; deferring to exit until the group completion ... , Error_code: 0
2017-07-10T15:23:45.146641Z 737328 [ERROR] Slave SQL for channel '': ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code: 1756

When I execute

mysql> stop slave sql_thread; (hangs)

... the command hangs (looks like forever, it's about 7 minutes now).

====

Here are several SHOW SLAVE STATUS outputs:

*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.1.104
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: xhost_4-bin.005394
          Read_Master_Log_Pos: 161629001
               Relay_Log_File: xhost_3-relay-bin.001633
                Relay_Log_Pos: 451
        Relay_Master_Log_File: xhost_4-bin.005393
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table: temp.%,mysql.%
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 274
              Relay_Log_Space: 1237567965
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 1971
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 4
                  Master_UUID: a9999f14-0061-11e7-ae3f-0cc47ae26f1a
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 1800
          SQL_Remaining_Delay: 1779
      Slave_SQL_Running_State: Waiting until MASTER_DELAY seconds after master executed event
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
               M...

Changed in percona-server:
status: Incomplete → New
Revision history for this message
Nickolay Ihalainen (ihanick) wrote :

Hi Vojtech,

I've create replication setup with:
https://github.com/ihanick/mysql-replica

Created load on replication:
while true ; do for i in `seq 10` ; do mysql -pmaster_passw0rd -e "drop database if exists sbtest${i};create database sbtest${i};";sysbench --mysql-user=root --myssword=master_passw0rd --mysql-db=sbtest$i --table-size=1000000 /usr/share/sysbench/oltp_insert.lua prepare &> $i.log & echo $i; done ; wait; echo next ; done

STOP SLAVE SQL_THREAD FOR CHANNEL '';
CHANGE MASTER TO MASTER_DELAY=1800;
START SLAVE SQL_THREAD FOR CHANNEL '';

stop slave;
set global slave_parallel_workers=16;
start slave;

SQL_Remaining_Delay decreased until NULL as expected, there is no replication stale, stop slave worked on all stages (up to 7 seconds for stop).

If it's still possible to reproduce the issue, please:
Install gdb, debug info for percona server (from dbg package)
Force slave server to hang, execute stop slave.
Execute pt-pmp -k lp1701554-pmp.txt
Attach lp1701554-pmp.txt
The command will get stack trace for all threads with gdb and maybe it will be possible to find the reason of deadlock.

Or try to reduce the issue (e.g. maybe master_delay is not important to reproduce the issue and the problem happens with parallel replicaiton on each log rotation).

Changed in percona-server:
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for Percona Server because there has been no activity for 60 days.]

Changed in percona-server:
status: Incomplete → Expired
Revision history for this message
Vojtech Kurka (vojtech-kurka) wrote :
Download full text (4.7 KiB)

see the attachment. Is this OK? I installed "Percona-Server-57-debuginfo", bud did not restart the server then - is it needed?

warning: the debug information found in "/usr/lib/debug//usr/sbin/mysqld.debug" does not match "/usr/sbin/mysqld" (CRC mismatch).

warning: the debug information found in "/usr/lib/debug/usr/sbin/mysqld.debug" does not match "/usr/sbin/mysqld" (CRC mismatch).

warning: the debug information found in "/usr/lib/debug//usr/lib64/mysql/plugin/libfnv1a_udf.so.debug" does not match "/usr/lib64/mysql/plugin/libfnv1a_udf.so" (CRC mismatch).

warning: the debug information found in "/usr/lib/debug/usr/lib64/mysql/plugin/libfnv1a_udf.so.debug" does not match "/usr/lib64/mysql/plugin/libfnv1a_udf.so" (CRC mismatch).

warning: the debug information found in "/usr/lib/debug//usr/lib64/mysql/plugin/libfnv_udf.so.debug" does not match "/usr/lib64/mysql/plugin/libfnv_udf.so" (CRC mismatch).

warning: the debug information found in "/usr/lib/debug/usr/lib64/mysql/plugin/libfnv_udf.so.debug" does not match "/usr/lib64/mysql/plugin/libfnv_udf.so" (CRC mismatch).

warning: the debug information found in "/usr/lib/debug//usr/lib64/mysql/plugin/libmurmur_udf.so.debug" does not match "/usr/lib64/mysql/plugin/libmurmur_udf.so" (CRC mismatch).

warning: the debug information found in "/usr/lib/debug/usr/lib64/mysql/plugin/libmurmur_udf.so.debug" does not match "/usr/lib64/mysql/plugin/libmurmur_udf.so" (CRC mismatch).

     16 pthread_cond_wait,pop_jobs_item,slave_worker_exec_job_group,handle_slave_worker,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6)
     10 __io_getevents_0_4(libaio.so.1),LinuxAIOHandler::collect,LinuxAIOHandler::poll,os_aio_handler,fil_aio_wait,io_handler_thread,start_thread(libpthread.so.0),clone(libc.so.6)
      8 nanosleep(libpthread.so.0),os_thread_sleep,buf_lru_manager,start_thread(libpthread.so.0),clone(libc.so.6)
      3 pthread_cond_wait,os_event::wait_low,buf_flush_page_cleaner_worker,start_thread(libpthread.so.0),clone(libc.so.6)
      2 pthread_cond_wait,Per_thread_connection_handler::block_until_new_connection,handle_connection,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6)
      2 poll(libc.so.6),vio_io_wait,vio_socket_io_wait,vio_read,net_read_raw_loop,net_read_packet,my_net_read,Protocol_classic::read_packet,Protocol_classic::get_command,do_command,handle_connection,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6)
      1 sigwait(libpthread.so.0),signal_hand,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6)
      1 sigwaitinfo(libc.so.6),timer_notify_thread_func,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_wait,os_event::wait_low,srv_purge_coordinator_thread,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_wait,os_event::wait_low,buf_resize_thread,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_wait,os_event::wait_low,buf_dump_thread,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_wait,compress_gtid_table,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_timedwait,os_event::timed_wait,os_event::wait_time_lo...

Read more...

Revision history for this message
Vojtech Kurka (vojtech-kurka) wrote :
Download full text (4.9 KiB)

I forgot to run "stop slave", so once again:

Sat Dec 16 12:18:57 CET 2017

warning: the debug information found in "/usr/lib/debug//usr/sbin/mysqld.debug" does not match "/usr/sbin/mysqld" (CRC mismatch).

warning: the debug information found in "/usr/lib/debug/usr/sbin/mysqld.debug" does not match "/usr/sbin/mysqld" (CRC mismatch).

warning: the debug information found in "/usr/lib/debug//usr/lib64/mysql/plugin/libfnv1a_udf.so.debug" does not match "/usr/lib64/mysql/plugin/libfnv1a_udf.so" (CRC mismatch).

warning: the debug information found in "/usr/lib/debug/usr/lib64/mysql/plugin/libfnv1a_udf.so.debug" does not match "/usr/lib64/mysql/plugin/libfnv1a_udf.so" (CRC mismatch).

warning: the debug information found in "/usr/lib/debug//usr/lib64/mysql/plugin/libfnv_udf.so.debug" does not match "/usr/lib64/mysql/plugin/libfnv_udf.so" (CRC mismatch).

warning: the debug information found in "/usr/lib/debug/usr/lib64/mysql/plugin/libfnv_udf.so.debug" does not match "/usr/lib64/mysql/plugin/libfnv_udf.so" (CRC mismatch).

warning: the debug information found in "/usr/lib/debug//usr/lib64/mysql/plugin/libmurmur_udf.so.debug" does not match "/usr/lib64/mysql/plugin/libmurmur_udf.so" (CRC mismatch).

warning: the debug information found in "/usr/lib/debug/usr/lib64/mysql/plugin/libmurmur_udf.so.debug" does not match "/usr/lib64/mysql/plugin/libmurmur_udf.so" (CRC mismatch).

     16 pthread_cond_wait,pop_jobs_item,slave_worker_exec_job_group,handle_slave_worker,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6)
     10 __io_getevents_0_4(libaio.so.1),LinuxAIOHandler::collect,LinuxAIOHandler::poll,os_aio_handler,fil_aio_wait,io_handler_thread,start_thread(libpthread.so.0),clone(libc.so.6)
      8 nanosleep(libpthread.so.0),os_thread_sleep,buf_lru_manager,start_thread(libpthread.so.0),clone(libc.so.6)
      3 pthread_cond_wait,os_event::wait_low,buf_flush_page_cleaner_worker,start_thread(libpthread.so.0),clone(libc.so.6)
      2 poll(libc.so.6),vio_io_wait,vio_socket_io_wait,vio_read,net_read_raw_loop,net_read_packet,my_net_read,Protocol_classic::read_packet,Protocol_classic::get_command,do_command,handle_connection,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6)
      1 sigwait(libpthread.so.0),signal_hand,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6)
      1 sigwaitinfo(libc.so.6),timer_notify_thread_func,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_wait,Per_thread_connection_handler::block_until_new_connection,handle_connection,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_wait,os_event::wait_low,srv_purge_coordinator_thread,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_wait,os_event::wait_low,buf_resize_thread,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_wait,os_event::wait_low,buf_dump_thread,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_wait,compress_gtid_table,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_timedwait,terminate_slave_thread[clone.part.30],terminate_slave_threads,stop_slave,stop_slave,stop_slave_...

Read more...

Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PS-3716

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.