PXC node as async slave has issues updating exec_master_log_pos and relay_master_log_file

Bug #1169247 reported by Jay Janssen
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Status tracked in 5.6
5.5
Invalid
Undecided
Unassigned
5.6
Invalid
Undecided
Unassigned

Bug Description

This is the second customer using PXC that I've seen this on, clearly a PXC/codership-mysql bug:

So, basically I have a 3 node cluster with 1 node acting as an async slave of some other master. In this case the master is 5.5.24 PS, and the PXC cluster is Percona-XtraDB-Cluster-server-5.5.29-23.7.2.389.rhel6.x86_64

The symptoms are:

- slave threads are clearly running
  - replication traffic on cluster confirms this
  - pt-heartbeat from async master is updating

However, these variables in SHOW SLAVE STATUS get stuck at some point and stop updating. This doesn't happen immediately after starting the slave:

mysql> select * from percona.heartbeat; show slave status\G select sleep( 10 ); select * from percona.heartbeat; show slave status\G
+----------------------------+-----------+------------+-----------+-----------------------+---------------------+
| ts | server_id | file | position | relay_master_log_file | exec_master_log_pos |
+----------------------------+-----------+------------+-----------+-----------------------+---------------------+
| 2013-04-15T07:35:57.000960 | 1 | bin.046443 | 975663927 | NULL | NULL |
+----------------------------+-----------+------------+-----------+-----------------------+---------------------+
1 row in set (0.00 sec)

*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: tinadb
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: bin.046460
          Read_Master_Log_Pos: 31022640
               Relay_Log_File: tabasco-relay-bin.000015
                Relay_Log_Pos: 55082221
        Relay_Master_Log_File: bin.046431
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
          Exec_Master_Log_Pos: 55082081

+-------------+
| sleep( 10 ) |
+-------------+
| 0 |
+-------------+
1 row in set (10.03 sec)

+----------------------------+-----------+------------+------------+-----------------------+---------------------+
| ts | server_id | file | position | relay_master_log_file | exec_master_log_pos |
+----------------------------+-----------+------------+------------+-----------------------+---------------------+
| 2013-04-15T07:36:39.000850 | 1 | bin.046443 | 1011229262 | NULL | NULL |
+----------------------------+-----------+------------+------------+-----------------------+---------------------+
1 row in set (0.00 sec)

*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: tinadb
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: bin.046460
          Read_Master_Log_Pos: 49417517
               Relay_Log_File: tabasco-relay-bin.000015
                Relay_Log_Pos: 55082221
        Relay_Master_Log_File: bin.046431
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
          Exec_Master_Log_Pos: 55082081
              Relay_Log_Space: 18829253023
             Master_Server_Id: 1
1 row in set (0.00 sec)

Clearly the heartbeat table (generated by pt-heartbeat on the master) is updating, but SHOW SLAVE STATUS is not for:
Relay_Master_Log_File
Exec_Master_Log_Pos
Relay_Log_File
Relay_Log_Pos

Additionally, I can see that while it's very unlikely that I'm on Relay_Log_File: tabasco-relay-bin.000015, that file is still in the datadir -- the replication threads are not rotating it.

I've seen this twice now on real production servers. Both times were cases where there was a big replication lag to catch up from.

Revision history for this message
Jay Janssen (jay-janssen) wrote :

This seems similar to : https://bugs.launchpad.net/percona-server/+bug/860910

Though that was fixed in PS 5.5.17.

Revision history for this message
Jay Janssen (jay-janssen) wrote :

This is at least somewhat related to using wsrep_mysql_replication_bundle

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

Is this related to lp:1169329 and lp:1169326 and so can be combined with others?

Revision history for this message
Nilnandan Joshi (nilnandan-joshi) wrote :

Couldn't able to reproduce this with PXC 5.5.37 and PXC 5.6.19. Created the master (async) and 3 node cluster with 1 node acting as an async slave of that master. I have tried multiple times to add insert statements even with sysbench but haven't seen that exec_master_log_pos and relay_master_log_file are not updating properly. It works fine.

[root@percona-pxc55-1 nilnandan]# sysbench --test=oltp --oltp-table-size=1000000 --mysql-db=dbtest --mysql-user=msandbox --mysql-password=msandbox --mysql-socket=/tmp/mysql_sandbox5537.sock --db-driver=mysql --oltp-auto-inc=off prepare
sysbench 0.4.12: multi-threaded system evaluation benchmark
Creating table 'sbtest'...
Creating 1000000 records in table 'sbtest'...
[root@percona-pxc55-1 nilnandan]#

[root@percona-pxc55-1 nilnandan]# sysbench --test=oltp --oltp-table-size=1000000 --oltp-test-mode=complex --oltp-read-only=off --num-threads=100 --max-time=180 --max-requests=0 --mysql-db=dbtest --mysql-user=msandbox --mysql-password=msandbox --mysql-socket=/tmp/mysql_sandbox5537.sock --db-driver=mysql run
sysbench 0.4.12: multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 100

Doing OLTP test.
Running mixed OLTP test
Using Special distribution (12 iterations, 1 pct of values are returned in 75 pct cases)
Using "BEGIN" for starting transactions
Using auto_inc on the id column
Threads started!
Time limit exceeded, exiting...
(last message repeated 99 times)
Done.

OLTP test statistics:
queries performed:
read: 750792
write: 268140
other: 107256
total: 1126188
transactions: 53628 (296.96 per sec.)
deadlocks: 0 (0.00 per sec.)
read/write requests: 1018932 (5642.32 per sec.)
other operations: 107256 (593.93 per sec.)

Test execution summary:
total time: 180.5873s
total number of events: 53628
total time taken by event execution: 17995.9303
per-request statistics:
min: 33.33ms
avg: 335.57ms
max: 7037.34ms
approx. 95 percentile: 536.68ms

Threads fairness:
events (avg/stddev): 536.2800/6.95
execution time (avg/stddev): 179.9593/0.13

[root@percona-pxc55-1 nilnandan]#

Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1334

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.