MSR replication - Galera clash

Bug #1676464 reported by Przemek on 2017-03-27
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
New
Undecided
Unassigned
5.7
Fix Committed
Undecided
Unassigned

Bug Description

Setting up multi-source replication on a PXC node results in permanent 'System lock' state for second channel SQL thread.

Test case:
* setup 2 standalone MySQL 5.7 instances
* setup 1 standalone PXC 5.7 node, with wsrep provider enabled
* setup replication from both standalone MySQL instances, using two channels, to the PXC node
* restart the slave PXC node

The slave node gets permanently blocked - not able to stop it gracefully, nor kill the locked SQL thread.

Example results:
mysql> pager egrep "Running|Channel|Err"
PAGER set to 'egrep "Running|Channel|Err"'
mysql> show slave status\G
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
                   Last_Errno: 0
                   Last_Error:
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
      Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
                 Channel_Name: c3-c1
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
                   Last_Errno: 0
                   Last_Error:
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
      Slave_SQL_Running_State: System lock
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
                 Channel_Name: c3-c2
2 rows in set (0.00 sec)

mysql> show processlist;
+----+-------------+-----------+------+---------+------+--------------------------------------------------------+------------------+-----------+---------------+
| Id | User | Host | db | Command | Time | State | Info | Rows_sent | Rows_examined |
+----+-------------+-----------+------+---------+------+--------------------------------------------------------+------------------+-----------+---------------+
| 1 | system user | | NULL | Sleep | 617 | wsrep: applier idle | NULL | 0 | 0 |
| 2 | system user | | NULL | Sleep | 617 | wsrep: aborter idle | NULL | 0 | 0 |
| 3 | system user | | NULL | Connect | 616 | Waiting for master to send event | NULL | 0 | 0 |
| 4 | system user | | NULL | Connect | 616 | Slave has read all relay log; waiting for more updates | NULL | 0 | 0 |
| 5 | system user | | NULL | Connect | 616 | Waiting for master to send event | NULL | 0 | 0 |
| 6 | system user | | NULL | Connect | 616 | System lock | NULL | 0 | 0 |
| 9 | root | localhost | NULL | Query | 0 | starting | show processlist | 0 | 0 |
+----+-------------+-----------+------+---------+------+--------------------------------------------------------+------------------+-----------+---------------+
7 rows in set (0.00 sec)

mysql> SELECT * FROM performance_schema.replication_connection_status\G
*************************** 1. row ***************************
             CHANNEL_NAME: c3-c1
               GROUP_NAME:
              SOURCE_UUID: f257ee64-0e06-11e7-b890-0242ac110002
                THREAD_ID: 55
            SERVICE_STATE: ON
COUNT_RECEIVED_HEARTBEATS: 12
 LAST_HEARTBEAT_TIMESTAMP: 2017-03-27 14:18:26
 RECEIVED_TRANSACTION_SET:
        LAST_ERROR_NUMBER: 0
       LAST_ERROR_MESSAGE:
     LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
*************************** 2. row ***************************
             CHANNEL_NAME: c3-c2
               GROUP_NAME:
              SOURCE_UUID: e893fd74-1090-11e7-9df2-0242ac150002
                THREAD_ID: 57
            SERVICE_STATE: ON
COUNT_RECEIVED_HEARTBEATS: 12
 LAST_HEARTBEAT_TIMESTAMP: 2017-03-27 14:18:26
 RECEIVED_TRANSACTION_SET: 125ad4e4-ef6f-ee18-6575-649192f28012:1-2
        LAST_ERROR_NUMBER: 0
       LAST_ERROR_MESSAGE:
     LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
2 rows in set (0.00 sec)

mysql> SELECT * FROM performance_schema.threads where name like '%slave_sql'\G
*************************** 1. row ***************************
          THREAD_ID: 56
               NAME: thread/sql/slave_sql
               TYPE: FOREGROUND
     PROCESSLIST_ID: 4
   PROCESSLIST_USER: NULL
   PROCESSLIST_HOST: NULL
     PROCESSLIST_DB: NULL
PROCESSLIST_COMMAND: Connect
   PROCESSLIST_TIME: 589
  PROCESSLIST_STATE: Slave has read all relay log; waiting for more updates
   PROCESSLIST_INFO: NULL
   PARENT_THREAD_ID: 1
               ROLE: NULL
       INSTRUMENTED: YES
            HISTORY: YES
    CONNECTION_TYPE: NULL
       THREAD_OS_ID: 16813
*************************** 2. row ***************************
          THREAD_ID: 58
               NAME: thread/sql/slave_sql
               TYPE: FOREGROUND
     PROCESSLIST_ID: 6
   PROCESSLIST_USER: NULL
   PROCESSLIST_HOST: NULL
     PROCESSLIST_DB: NULL
PROCESSLIST_COMMAND: Connect
   PROCESSLIST_TIME: 589
  PROCESSLIST_STATE: NULL
   PROCESSLIST_INFO: NULL
   PARENT_THREAD_ID: 1
               ROLE: NULL
       INSTRUMENTED: YES
            HISTORY: YES
    CONNECTION_TYPE: NULL
       THREAD_OS_ID: 16815
2 rows in set (0.00 sec)

mysql> show status like 'ws%';
...
| wsrep_local_state_comment | Synced |
...
| wsrep_cluster_status | Primary |
...
| wsrep_provider_version | 3.20(r7e383f7) |
| wsrep_ready | ON |
+------------------------------+--------------------------------------+
60 rows in set (0.00 sec)

mysql> stop slave for channel 'c3-c1';
Query OK, 0 rows affected (0.00 sec)

mysql> stop slave for channel 'c3-c2';
... hangs

Kenn Takara (kenn-takara) wrote :

Tried to repro this on 5.7.16 and it looks ok. Are the repl users on the channels the same or different?

Kenn Takara (kenn-takara) wrote :

I've noticed that a possible workaround may be to
(1) Start the node with 'skip-slave-start' in the config
(2) run "start slave"
This appears to start the slave threads with no issues.

Krunal Bauskar (krunal-bauskar) wrote :

Przemek,

Can you check the comments from Kenn.

Kenn Takara (kenn-takara) wrote :

This is due to a bug in the startup code. The two slave threads are waiting on a condition variable for wsrep_ready. However, when wsrep_ready is changed, it uses mysql_cond_signal() which wakes up a SINGLE thread. So one of the slave threads is signalled and continues, the other thread is waiting on the condition variable, which will not be signalled again.

The solution is to use mysql_cond_broadcast() to wake up all threads waiting on the wsrep_ready condition variable.

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-826

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers