online upgrade GTID cause binlog damage in high write QPS situation

Bug #1516919 reported by songjiao
24
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Percona Server moved to https://jira.percona.com/projects/PS
Status tracked in 5.7
5.5
Invalid
Undecided
Unassigned
5.6
Triaged
High
Unassigned
5.7
Invalid
Undecided
Unassigned

Bug Description

percona verson:
$ rpm -qa |grep -i percona
Percona-Server-client-56-5.6.27-rel75.0.el6.x86_64
Percona-Server-tokudb-56-5.6.27-rel75.0.el6.x86_64
Percona-Server-devel-56-5.6.27-rel75.0.el6.x86_64
Percona-Server-shared-56-5.6.27-rel75.0.el6.x86_64
Percona-Server-server-56-5.6.27-rel75.0.el6.x86_64
Percona-Server-test-56-5.6.27-rel75.0.el6.x86_64
Percona-Server-56-debuginfo-5.6.27-rel75.0.el6.x86_64

OS:
$ cat /etc/redhat-release
CentOS release 6.6 (Final)
$ uname -a
Linux 2.6.32-504.16.2.el6.x86_64 #1 SMP Wed Apr 22 06:48:29 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

Server:
aws c4.4xlarge with 3.4TB gp2 ssd volume

Runing 2 mysql instance in one server 3306(master)<-3307(slave)

Myql config file :
------------------------------------------------------------------------------------------------------------------------------------------
[mysqld_multi]
mysqld = /usr/bin/mysqld_safe
mysqladmin = /usr/bin/mysqladmin

[mysqld3306]
socket = /data/mysqldata/mysql3306/mysql3306.sock
port = 3306
pid-file = /data/mysqldata/mysql3306/bjzw3306.pid
datadir = /data/mysqldata/mysql3306
user = mysql
server-id = 1523306

gtid_mode = ON
gtid_deployment_step = ON
log_slave_updates = 1
enforce_gtid_consistency = 1

log_error = /data/mysqllog/mysql3306/yxmysql-error.log
log_bin = /data/mysqllog/mysql3306/yxmysql-bin.log
relay_log = /data/mysqllog/mysql3306/yxmysql-relay-bin.log
slow_query_log_file=/data/mysqllog/mysql3306/slow-queries.log

log_bin_trust_function_creators = 1

thread_handling = pool-of-threads
thread_pool_max_threads = 100000
thread_pool_size = 22

slave-skip-errors=1062
slave_compressed_protocol=ON

log-slave-updates
skip-slave-start

innodb_file_per_table = 1
innodb_file_format=BARRACUDA

innodb_thread_concurrency = 0
thread_cache_size = 32

innodb_io_capacity = 20000
innodb_read_io_threads = 8
innodb_write_io_threads = 8

innodb_log_file_size = 56M
innodb_log_files_in_group = 3
innodb_log_buffer_size = 8M

innodb_buffer_pool_size = 56M
innodb_buffer_pool_instances = 2

query_cache_type = OFF

long_query_time = 1
slow_query_log = ON
log-queries-not-using-indexes
log-slow-admin-statements
max_connect_errors = 200
max_connections = 10000
character_set_server = utf8mb4
character-set-client-handshake = FALSE
collation-server = utf8mb4_unicode_ci
init_connect='SET NAMES utf8mb4'
table_open_cache=10240
open_files_limit=20480
innodb_flush_log_at_trx_commit=0
innodb_open_files=10000
expire_logs_days = 7

--------------------------------------------------END-------------------------------------------------------------------------------------
3307's conf is same with 3306 with diffrent server-id and port number

How to replay:
1、set up mysql cluster as 3306(master)<--3307(slave)
2、create table and user for test in 3306
create user test@127.0.0.1 identified by '123456';
grant all on test.* to test@127.0.0.1;
use test;
create table seq_test(
iid bigint(20) auto_increment,
dt1 timestamp,
dt2 timestamp default CURRENT_TIMESTAMP,
primary key(iid));

3、create a script to load data into mysql
$ cat /tmp/insert.py
#!/usr/bin/python
# -*- coding:utf-8 -*-

import MySQLdb
import datetime

db = MySQLdb.connect('127.0.0.1','test','123456', '', 3306)
cursor = db.cursor()
cnt = 0
while True:
    cursor.execute('''insert into test.seq_test(dt1) values('%s');''' %(datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')))
    cnt = cnt + 1
    if cnt % 10000 == 0:
        print cnt
    db.commit()

4、run 3 load data script instance to load data
python /tmp/insert.py &
python /tmp/insert.py &
python /tmp/insert.py &

now the wirte qps nearly 30000

5、turn off gtid_deployment_step in master (3306)
SET GLOBAL gtid_deployment_step = OFF;

6、execute "show slave status\G;" in 3307 shows:
                Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'log event entry exceeded max_allowed_packet; Increase max_allowed_packet on master; the first event 'yxmysql-bin.000001' at 151, the last event read from '/data/mysqllog/mysql3306/yxmysql-bin.000001' at 464889689, the last byte read from '/data/mysqllog/mysql3306/yxmysql-bin.000001' at 464889708.'

this because the binlog file in the master is damaged while turn GTID ON

I have tried 3 times ,the first time is ok,but the last 2 times are meet binlog damage problem.

Revision history for this message
Sveta Smirnova (svetasmirnova) wrote :

Thank you for the report.

> 5、turn off gtid_deployment_step in master (3306)
> SET GLOBAL gtid_deployment_step = OFF;

But according to https://www.percona.com/doc/percona-server/5.6/flexibility/online_gtid_deployment.html gtid_deployment_step should be ON on slave only. This seems to not a bug for me.

Changed in percona-server:
status: New → Invalid
Revision history for this message
songjiao (songjiao001) wrote :

actually 3306 is a slave before,I choose 3306 as the new master。the other slave are gtid_deployment_step=on.
when I turn 3306's gtid_deployment_step=off and with 30000 insert per seconds cause binlog damage

Revision history for this message
Jeremy Tinley (techwolf359) wrote :

We had the same problem in a high write situation. We have master<->master with each master having 2 replicas. We point writes at A, enable gtid, deployment step and consistency on all 5 replicas and restart. Then repoint writes to B and bounce A to get them all in sync with changes.

Then on the B master, we disable deployment step and every replica of B stopped with binlog corruption. The last event written was the SET @@SESSION.GTID_NEXT event.

Changed in percona-server:
status: Invalid → New
Revision history for this message
Jeremy Tinley (techwolf359) wrote :

We're running PS 5.6.34 and experienced this problem today.

Revision history for this message
Ernie Souhrada (denshikarasu) wrote :

This happened to us last week on a server that wasn't even doing 1000 writes per second. Amazon AWS i2.2xlarge with SSDs in RAID0. Percona Server 5.6.33.

Revision history for this message
Jeremy Tinley (techwolf359) wrote :

We're considering LOCK BINLOG FOR BACKUP as a possible workaround to delay writes while deployment step is is disabled. If that is unsuccessful, we may have to go to read_only while we disable deployment step.

Revision history for this message
Sveta Smirnova (svetasmirnova) wrote :
Download full text (4.4 KiB)

I managed to crash debug server with following backtrace:

Thread 1 (Thread 0x7fc8c061c700 (LWP 5229)):
#0 __pthread_kill (threadid=<optimized out>, signo=6) at ../sysdeps/unix/sysv/linux/pthread_kill.c:62
#1 0x0000000000b14459 in my_write_core (sig=6) at /home/sveta/src/percona-server/mysys/stacktrace.c:424
#2 0x000000000074e61d in handle_fatal_signal (sig=6) at /home/sveta/src/percona-server/sql/signal_handler.cc:236
#3 <signal handler called>
#4 0x00007fc8ccbc8428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#5 0x00007fc8ccbca02a in __GI_abort () at abort.c:89
#6 0x00007fc8ccbc0bd7 in __assert_fail_base (fmt=<optimized out>,
    assertion=assertion@entry=0xfa7668 "! is_set() || m_can_overwrite_status",
    file=file@entry=0xfa7458 "/home/sveta/src/percona-server/sql/sql_error.cc", line=line@entry=521,
    function=function@entry=0xfa7a00 <Diagnostics_area::set_error_status(unsigned int, char const*, char const*, Sql_condition const*)::__PRETTY_FUNCTION__> "void Diagnostics_area::set_error_status(uint, const char*, const char*, const Sql_condition*)") at assert.c:92
#7 0x00007fc8ccbc0c82 in __GI___assert_fail (assertion=0xfa7668 "! is_set() || m_can_overwrite_status",
    file=0xfa7458 "/home/sveta/src/percona-server/sql/sql_error.cc", line=521,
    function=0xfa7a00 <Diagnostics_area::set_error_status(unsigned int, char const*, char const*, Sql_condition const*)::__PRETTY_FUNCTION__> "void Diagnostics_area::set_error_status(uint, const char*, const char*, const Sql_condition*)") at assert.c:101
#8 0x00000000007d7121 in Diagnostics_area::set_error_status (this=0x7fc884004000, sql_errno=1837,
    message=0x7fc8c061b1f0 "When @@SESSION.GTID_NEXT is set to a GTID, you must explicitly set it to a different value after a COMMIT or ROLLBACK. Please check GTID_NEXT variable manual page for detailed explanation. Current @@S"..., sqlstate=0xfc14e1 "HY000", error_condition=0x0) at /home/sveta/src/percona-server/sql/sql_error.cc:521
#9 0x00000000007bc21f in THD::raise_condition (this=0x7fc884000a50, sql_errno=1837, sqlstate=0xfc14e1 "HY000",
    level=Sql_condition::WARN_LEVEL_ERROR,
    msg=0x7fc8c061b1f0 "When @@SESSION.GTID_NEXT is set to a GTID, you must explicitly set it to a different value after a COMMIT or ROLLBACK. Please check GTID_NEXT variable manual page for detailed explanation. Current @@S"...)
    at /home/sveta/src/percona-server/sql/sql_class.cc:1421
#10 0x00000000006363ae in my_message_sql (error=1837,
---Type <return> to continue, or q <return> to quit---
    str=0x7fc8c061b1f0 "When @@SESSION.GTID_NEXT is set to a GTID, you must explicitly set it to a different value after a COMMIT or ROLLBACK. Please check GTID_NEXT variable manual page for detailed explanation. Current @@S"..., MyFlags=0) at /home/sveta/src/percona-server/sql/mysqld.cc:3613
#11 0x0000000000b0a023 in my_error (nr=1837, MyFlags=0) at /home/sveta/src/percona-server/mysys/my_error.c:185
#12 0x0000000000a599e4 in gtid_pre_statement_checks (thd=0x7fc884000a50)
    at /home/sveta/src/percona-server/sql/rpl_gtid_execution.cc:360
#13 0x0000000000a75beb in Rows_log_event::do_apply_event (this=0x7fc88400e2c0, rli=0x35e95c0)
  ...

Read more...

Revision history for this message
Sveta Smirnova (svetasmirnova) wrote :
Revision history for this message
Sveta Smirnova (svetasmirnova) wrote :

I see two issues here.

1. Master if variable gtid_deployment_step changes dynamically may generate invalid binary log like the extract attached.

This may be considered as a feature request, but still may be easy to fix.

I expect if you ensure all writes to master stopped before changing this variable would resolve this issue. If you flush binary log before doing change I see no reason for new binary log to be corrupted.

2. Slave SQL thread crash when it tries to apply such binary log.

If I try to apply it to the master using SQL interface there is no crash:

mysqlmtr -P13001 test < bug1516919-1.sql
ERROR 1837 (HY000) at line 67: When @@SESSION.GTID_NEXT is set to a GTID, you must explicitly set it to a different value after a COMMIT or ROLLBACK. Please check GTID_NEXT variable manual page for detailed explanation. Current @@SESSION.GTID_NEXT is 'ac2dfc05-0fc7-11e7-bf16-30b5c2208a0f:480780'.

Bug confirmed.

Changed in percona-server:
status: New → Confirmed
Revision history for this message
Jeremy Tinley (techwolf359) wrote :

Not to be confrontational but stopping writes to the master isn't really an online rollout. :)

tags: added: gtid-deployment-step
Revision history for this message
Keyur (keyurdg) wrote :
Revision history for this message
Laurynas Biveinis (laurynas-biveinis) wrote :

I would like your feedback as for the best way to fix this bug.

Our documented online GTID deployment procedure (https://www.percona.com/doc/percona-server/5.6/flexibility/online_gtid_deployment.html) never permits local workload on a node with gtid_deployment_step = ON. On slaves, when it is enabled, only replicated master workload is running, and the variable is reset to OFF on the new master promotion.

Based on this, it seems that it makes sense to implement behaving as if under (super-)read-only when gtid_deployment_step is set, under this bug, in addition to other fixes.

Do you agree or am I missing some intended use scenarios?

Revision history for this message
Jeremy Tinley (techwolf359) wrote :

The PR above seems to solve the corruption issue for us but I'd like a second pair of eyes on it to make sure it seems like the right approach.

Revision history for this message
Laurynas Biveinis (laurynas-biveinis) wrote :

The PR does indeed fix at least one of the issues, but a complete fix must consider my question above too. For example, right now, gtid_deployment_step = ON master workload results not only in binlog corruption but it itself is running as if binlogging were disabled for session.

Revision history for this message
Keyur (keyurdg) wrote :

I added another commit to fix the variable not "taking" upon server start-up. https://github.com/percona/percona-server/pull/1554/commits/bdb3778bce0270acac141b4c3e56136b8baa030d

Is that the issue you were describing Laurynas?

Revision history for this message
Keyur (keyurdg) wrote :

What this bugfix is missing is inserting an "SET @@SESSION.GTID_NEXT= 'AUTOMATIC'" into the master binlog stream after gtid_deployment_step is flipped from OFF to ON on a master that's taking write workload.

The lack of the "reset" GTID_NEXT statement manifests in gtid_next on the slave executor THD being "undefined" and stopping the thread. That said, "start slave;" fixes the issue and no statements are dropped.

I think documenting this behavior should suffice.

Revision history for this message
Laurynas Biveinis (laurynas-biveinis) wrote :

The startup variable setting issue needs to be fixed too, but the one I was talking about before is master with gtid_deployment_step=ON not binlogging its workload, which seems to be related to your latter comment.

Revision history for this message
Jeremy Tinley (techwolf359) wrote : Re: [Bug 1516919] Re: online upgrade GTID cause binlog damage in high write QPS situation
Download full text (5.6 KiB)

With deployment step on, the master does binlog correctly. All my tests show this works fine.

--
Jeremy Tinley
<email address hidden>

> On Mar 28, 2017, at 8:38 PM, Laurynas Biveinis <email address hidden> wrote:
>
> The startup variable setting issue needs to be fixed too, but the one I
> was talking about before is master with gtid_deployment_step=ON not
> binlogging its workload, which seems to be related to your latter
> comment.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1516919
>
> Title:
> online upgrade GTID cause binlog damage in high write QPS situation
>
> Status in Percona Server:
> Invalid
> Status in Percona Server 5.5 series:
> Invalid
> Status in Percona Server 5.6 series:
> Triaged
> Status in Percona Server 5.7 series:
> Invalid
>
> Bug description:
> percona verson:
> $ rpm -qa |grep -i percona
> Percona-Server-client-56-5.6.27-rel75.0.el6.x86_64
> Percona-Server-tokudb-56-5.6.27-rel75.0.el6.x86_64
> Percona-Server-devel-56-5.6.27-rel75.0.el6.x86_64
> Percona-Server-shared-56-5.6.27-rel75.0.el6.x86_64
> Percona-Server-server-56-5.6.27-rel75.0.el6.x86_64
> Percona-Server-test-56-5.6.27-rel75.0.el6.x86_64
> Percona-Server-56-debuginfo-5.6.27-rel75.0.el6.x86_64
>
> OS:
> $ cat /etc/redhat-release
> CentOS release 6.6 (Final)
> $ uname -a
> Linux 2.6.32-504.16.2.el6.x86_64 #1 SMP Wed Apr 22 06:48:29 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>
>
> Server:
> aws c4.4xlarge with 3.4TB gp2 ssd volume
>
>
> Runing 2 mysql instance in one server 3306(master)<-3307(slave)
>
> Myql config file :
> ------------------------------------------------------------------------------------------------------------------------------------------
> [mysqld_multi]
> mysqld = /usr/bin/mysqld_safe
> mysqladmin = /usr/bin/mysqladmin
>
> [mysqld3306]
> socket = /data/mysqldata/mysql3306/mysql3306.sock
> port = 3306
> pid-file = /data/mysqldata/mysql3306/bjzw3306.pid
> datadir = /data/mysqldata/mysql3306
> user = mysql
> server-id = 1523306
>
>
> gtid_mode = ON
> gtid_deployment_step = ON
> log_slave_updates = 1
> enforce_gtid_consistency = 1
>
>
> log_error = /data/mysqllog/mysql3306/yxmysql-error.log
> log_bin = /data/mysqllog/mysql3306/yxmysql-bin.log
> relay_log = /data/mysqllog/mysql3306/yxmysql-relay-bin.log
> slow_query_log_file=/data/mysqllog/mysql3306/slow-queries.log
>
> log_bin_trust_function_creators = 1
>
> thread_handling = pool-of-threads
> thread_pool_max_threads = 100000
> thread_pool_size = 22
>
> slave-skip-errors=1062
> slave_compressed_protocol=ON
>
> log-slave-updates
> skip-slave-start
>
> innodb_file_per_table = 1
> innodb_file_format=BARRACUDA
>
> innodb_thread_concurrency = 0
> thread_cache_size = 32
>
> innodb_io_capacity = 20000
> innodb_read_io_threads = 8
> innodb_write_io_threads = 8
>
> innodb_log_file_size = 56M
> innodb_log_files_in_group = 3
> innodb_log_buffer_size = 8M
>
> innodb_buffer_pool_size = 56M
> innodb_buffer_pool_instances = 2
>
> query_cache_type = OFF
>
> long_query_time = 1
> slo...

Read more...

Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PS-3325

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.