MSR replication - Galera clash
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
| Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC |
New
|
Undecided
|
Unassigned | |
| 5.7 |
Fix Committed
|
Undecided
|
Unassigned |
Bug Description
Setting up multi-source replication on a PXC node results in permanent 'System lock' state for second channel SQL thread.
Test case:
* setup 2 standalone MySQL 5.7 instances
* setup 1 standalone PXC 5.7 node, with wsrep provider enabled
* setup replication from both standalone MySQL instances, using two channels, to the PXC node
* restart the slave PXC node
The slave node gets permanently blocked - not able to stop it gracefully, nor kill the locked SQL thread.
Example results:
mysql> pager egrep "Running|
PAGER set to 'egrep "Running|
mysql> show slave status\G
Slave_
Last_
Last_
Slave_
Last_
Last_
2 rows in set (0.00 sec)
mysql> show processlist;
+----+-
| Id | User | Host | db | Command | Time | State | Info | Rows_sent | Rows_examined |
+----+-
| 1 | system user | | NULL | Sleep | 617 | wsrep: applier idle | NULL | 0 | 0 |
| 2 | system user | | NULL | Sleep | 617 | wsrep: aborter idle | NULL | 0 | 0 |
| 3 | system user | | NULL | Connect | 616 | Waiting for master to send event | NULL | 0 | 0 |
| 4 | system user | | NULL | Connect | 616 | Slave has read all relay log; waiting for more updates | NULL | 0 | 0 |
| 5 | system user | | NULL | Connect | 616 | Waiting for master to send event | NULL | 0 | 0 |
| 6 | system user | | NULL | Connect | 616 | System lock | NULL | 0 | 0 |
| 9 | root | localhost | NULL | Query | 0 | starting | show processlist | 0 | 0 |
+----+-
7 rows in set (0.00 sec)
mysql> SELECT * FROM performance_
*******
COUNT_RECEIVED_
LAST_HEARTBEAT
RECEIVED_
LAST_
*******
COUNT_RECEIVED_
LAST_HEARTBEAT
RECEIVED_
LAST_
2 rows in set (0.00 sec)
mysql> SELECT * FROM performance_
*******
PROCESSLIS
PROCESSLIST_
PROCESSLIST_
PROCESSLIS
PROCESSLIST_
PROCESSLIST_
PROCESSLIST_
PROCESSLIST_
PARENT_
CONNECTION_
*******
PROCESSLIS
PROCESSLIST_
PROCESSLIST_
PROCESSLIS
PROCESSLIST_
PROCESSLIST_
PROCESSLIST_
PROCESSLIST_
PARENT_
CONNECTION_
2 rows in set (0.00 sec)
mysql> show status like 'ws%';
...
| wsrep_local_
...
| wsrep_cluster_
...
| wsrep_provider_
| wsrep_ready | ON |
+------
60 rows in set (0.00 sec)
mysql> stop slave for channel 'c3-c1';
Query OK, 0 rows affected (0.00 sec)
mysql> stop slave for channel 'c3-c2';
... hangs
Kenn Takara (kenn-takara) wrote : | #1 |
Kenn Takara (kenn-takara) wrote : | #2 |
I've noticed that a possible workaround may be to
(1) Start the node with 'skip-slave-start' in the config
(2) run "start slave"
This appears to start the slave threads with no issues.
Krunal Bauskar (krunal-bauskar) wrote : | #3 |
Przemek,
Can you check the comments from Kenn.
Kenn Takara (kenn-takara) wrote : | #4 |
This is due to a bug in the startup code. The two slave threads are waiting on a condition variable for wsrep_ready. However, when wsrep_ready is changed, it uses mysql_cond_signal() which wakes up a SINGLE thread. So one of the slave threads is signalled and continues, the other thread is waiting on the condition variable, which will not be signalled again.
The solution is to use mysql_cond_
Shahriyar Rzayev (rzayev-sehriyar) wrote : | #5 |
Percona now uses JIRA for bug reports so this bug report is migrated to: https:/
Tried to repro this on 5.7.16 and it looks ok. Are the repl users on the channels the same or different?