Cluster Failure after Online Schema Upgrade

Bug #1729592 reported by Harun Küçük on 2017-11-02
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Invalid
Undecided
Unassigned

Bug Description

Hello ,

Percona Xtra DB Cluster (5.6.30-76.3-56) wholw cluster failed after RSU action. I have 5 nodes in my cluster. I am listing below all steps.

I was adding new column to a table using rolling schema upgrade. Everything was fine and I have done same procedure before many times.

the steps are below. I applied these steps to my three nodes and all three nodes are ok. But on 4. node I applied step 8 and firstly Node4 crashed. After 10 15 seconds whole cluster failed. I am sharing success node log and failed node log bottom,

1.) disable node on load balancer and wait for active connections to end

2.) SET GLOBAL wsrep_OSU_method='RSU';

3.) SET GLOBAL wsrep_slave_threads=1;

4.) SET GLOBAL foreign_key_checks=OFF;

5.) SET foreign_key_checks=OFF;

6.) SET GLOBAL wsrep_cluster_address="gcomm://";

7.) ALTER TABLE tablename ADD COLUMN columnname VARCHAR(4) DEFAULT '';

8.) SET GLOBAL wsrep_cluster_address="gcomm://xxx.23.yyy.101:4567,xxx.23.yyy.103:4567,xxx.23.yyy.102:4567,xxx.17.yyy.101:4567,xxx.17.yyy.102:4567"

9.) SET foreign_key_checks=ON;

10.)SET GLOBAL foreign_key_checks=ON;

11.)SET GLOBAL wsrep_slave_threads=8;

12.)SET GLOBAL wsrep_OSU_method='TOI';

13.)enable node on load balancer

After failure I bootstrapped most advanced node and activated cluster again. it says that mysqld got signal 11 . But I have no idea about this failure and why whole cluster failed instead of failed node. is it a bug? maybe I am missing something. help please

Success node logs are below after step 8 applied:

2017-11-01 16:12:24 3157 [Note] WSREP: GMCast version 0
2017-11-01 16:12:24 3157 [Note] WSREP: (4d5ca708, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
2017-11-01 16:12:24 3157 [Note] WSREP: (4d5ca708, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
2017-11-01 16:12:24 3157 [Note] WSREP: EVS version 0
2017-11-01 16:12:24 3157 [Note] WSREP: gcomm: connecting to group 'my_wsrep_cluster', peer 'xxx.23.yyy.101:4567,xxx.23.yyy.103:4567,xxx.23.yyy.102:4567,xxx.17.yyy.101:4567,xxx.17.yyy.102:4567'
2017-11-01 16:12:24 3157 [Warning] WSREP: (4d5ca708, 'tcp://0.0.0.0:4567') address 'tcp://xxx.23.yyy.103:4567' points to own listening address, blacklisting
2017-11-01 16:12:24 3157 [Note] WSREP: (4d5ca708, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers:
2017-11-01 16:12:25 3157 [Note] WSREP: declaring 6268fd11 at tcp://xxx.23.yyy.102:4567 stable
2017-11-01 16:12:25 3157 [Note] WSREP: declaring f0346ac8 at tcp://xxx.23.yyy.101:4567 stable
2017-11-01 16:12:25 3157 [Note] WSREP: declaring f87ad44d at tcp://xxx.17.yyy.101:4567 stable
2017-11-01 16:12:25 3157 [Note] WSREP: Node 6268fd11 state prim
2017-11-01 16:12:25 3157 [Note] WSREP: view(view_id(PRIM,4d5ca708,1272) memb {
        4d5ca708,0
        6268fd11,0
        f0346ac8,0
        f87ad44d,0
} joined {
} left {
} partitioned {
})
2017-11-01 16:12:25 3157 [Note] WSREP: save pc into disk
2017-11-01 16:12:25 3157 [Note] WSREP: discarding pending addr without UUID: tcp://xxx.17.yyy.102:4567
2017-11-01 16:12:26 3157 [Note] WSREP: gcomm: connected
2017-11-01 16:12:26 3157 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
2017-11-01 16:12:26 3157 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 3691602)
2017-11-01 16:12:26 3157 [Note] WSREP: Opened channel 'my_wsrep_cluster'
2017-11-01 16:12:26 3157 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 4
2017-11-01 16:12:26 3157 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 4e475f33-bf06-11e7-8fcb-7a6ef3513c53
2017-11-01 16:12:26 3157 [Note] WSREP: STATE EXCHANGE: sent state msg: 4e475f33-bf06-11e7-8fcb-7a6ef3513c53
2017-11-01 16:12:26 3157 [Note] WSREP: STATE EXCHANGE: got state msg: 4e475f33-bf06-11e7-8fcb-7a6ef3513c53 from 0 (xxx.23.yyy.103)
2017-11-01 16:12:26 3157 [Note] WSREP: STATE EXCHANGE: got state msg: 4e475f33-bf06-11e7-8fcb-7a6ef3513c53 from 1 (xxx.23.yyy.102)
2017-11-01 16:12:26 3157 [Note] WSREP: STATE EXCHANGE: got state msg: 4e475f33-bf06-11e7-8fcb-7a6ef3513c53 from 2 (xxx.23.yyy.101)
2017-11-01 16:12:26 3157 [Note] WSREP: STATE EXCHANGE: got state msg: 4e475f33-bf06-11e7-8fcb-7a6ef3513c53 from 3 (xxx.17.yyy.101)
2017-11-01 16:12:26 3157 [Note] WSREP: Quorum results:
        version = 4,
        component = PRIMARY,
        conf_id = 676,
        members = 3/4 (joined/total),
        act_id = 3691645,
        last_appl. = 3691483,
        protocols = 0/7/3 (gcs/repl/appl),
        group UUID = 8c54c159-2f9f-11e5-bf11-2bfe454f6c4c
2017-11-01 16:12:26 3157 [Note] WSREP: Flow-control interval: [32, 32]
2017-11-01 16:12:26 3157 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 3691645)
2017-11-01 16:12:26 3157 [Note] WSREP: State transfer required:
        Group state: 8c54c159-2f9f-11e5-bf11-2bfe454f6c4c:3691645
        Local state: 8c54c159-2f9f-11e5-bf11-2bfe454f6c4c:3691602
2017-11-01 16:12:26 3157 [Note] WSREP: New cluster view: global state: 8c54c159-2f9f-11e5-bf11-2bfe454f6c4c:3691645, view# 677: Primary, number of nodes: 4, my index: 0, protocol version 3
2017-11-01 16:12:26 3157 [Warning] WSREP: Gap in state sequence. Need state transfer.
2017-11-01 16:12:26 3157 [Note] WSREP: You have configured 'xtrabackup-v2' state snapshot transfer method which cannot be performed on a running server. Wsrep provider won't be able to fall back to it if other means of state transfer are unavailable. In that case you will need to restart the server.
2017-11-01 16:12:26 3157 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2017-11-01 16:12:26 3157 [Note] WSREP: REPL Protocols: 7 (3, 2)
2017-11-01 16:12:26 3157 [Note] WSREP: Service thread queue flushed.
2017-11-01 16:12:26 3157 [Note] WSREP: Assign initial position for certification: 3691645, protocol version: 3
2017-11-01 16:12:26 3157 [Note] WSREP: Service thread queue flushed.
2017-11-01 16:12:26 3157 [Note] WSREP: IST receiver addr using tcp://xxx.23.yyy.103:4568
2017-11-01 16:12:26 3157 [Note] WSREP: Prepared IST receiver, listening at: tcp://xxx.23.yyy.103:4568
2017-11-01 16:12:26 3157 [Note] WSREP: Member 0.0 (xxx.23.yyy.103) requested state transfer from '*any*'. Selected 3.0 (xxx.17.yyy.101)(SYNCED) as donor.
2017-11-01 16:12:26 3157 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 3691647)
2017-11-01 16:12:26 3157 [Note] WSREP: Requesting state transfer: success, donor: 3
2017-11-01 16:12:26 3157 [Note] WSREP: Receiving IST: 43 writesets, seqnos 3691602-3691645
2017-11-01 16:12:26 3157 [Note] WSREP: 3.0 (xxx.17.yyy.101): State transfer to 0.0 (xxx.23.yyy.103) complete.
2017-11-01 16:12:26 3157 [Note] WSREP: Member 3.0 (xxx.17.yyy.101) synced with group.
2017-11-01 16:12:26 3157 [Note] WSREP: IST received: 8c54c159-2f9f-11e5-bf11-2bfe454f6c4c:3691645
2017-11-01 16:12:26 3157 [Note] WSREP: 0.0 (xxx.23.yyy.103): State transfer from 3.0 (xxx.17.yyy.101) complete.
2017-11-01 16:12:26 3157 [Note] WSREP: Shifting JOINER -> JOINED (TO: 3691647)
2017-11-01 16:12:26 3157 [Note] WSREP: Member 0.0 (xxx.23.yyy.103) synced with group.
2017-11-01 16:12:26 3157 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 3691647)
2017-11-01 16:12:26 3157 [Note] WSREP: Synchronized with group, ready for connections

And failed node logs are below after step 8 applied:

2017-11-01 16:22:10 3120 [Note] WSREP: GMCast version 0
2017-11-01 16:22:10 3120 [Note] WSREP: (aa8776f3, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
2017-11-01 16:22:10 3120 [Note] WSREP: (aa8776f3, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
2017-11-01 16:22:10 3120 [Note] WSREP: EVS version 0
2017-11-01 16:22:10 3120 [Note] WSREP: gcomm: connecting to group 'my_wsrep_cluster', peer 'xxx.23.yyy.101:4567,xxx.23.yyy.103:4567,xxx.23.yyy.102:4567,xxx.17.yyy.101:4567,xxx.17.yyy.102:4567'
2017-11-01 16:22:10 3120 [Warning] WSREP: (aa8776f3, 'tcp://0.0.0.0:4567') address 'tcp://xxx.23.yyy.101:4567' points to own listening address, blacklisting
2017-11-01 16:22:10 3120 [Note] WSREP: (aa8776f3, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers:
2017-11-01 16:22:11 3120 [Note] WSREP: declaring 4d5ca708 at tcp://xxx.23.yyy.103:4567 stable
2017-11-01 16:22:11 3120 [Note] WSREP: declaring 6268fd11 at tcp://xxx.23.yyy.102:4567 stable
2017-11-01 16:22:11 3120 [Note] WSREP: declaring f87ad44d at tcp://xxx.17.yyy.101:4567 stable
2017-11-01 16:22:11 3120 [Note] WSREP: Node 4d5ca708 state prim
2017-11-01 16:22:11 3120 [Note] WSREP: view(view_id(PRIM,4d5ca708,1274) memb {
        4d5ca708,0
        6268fd11,0
        aa8776f3,0
        f87ad44d,0
} joined {
} left {
} partitioned {
})
2017-11-01 16:22:11 3120 [Note] WSREP: save pc into disk
2017-11-01 16:22:11 3120 [Note] WSREP: discarding pending addr without UUID: tcp://xxx.17.yyy.102:4567
2017-11-01 16:22:11 3120 [Note] WSREP: gcomm: connected
2017-11-01 16:22:11 3120 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
2017-11-01 16:22:11 3120 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 3691718)
2017-11-01 16:22:11 3120 [Note] WSREP: Opened channel 'my_wsrep_cluster'
2017-11-01 16:22:11 3120 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 2, memb_num = 4
13:22:11 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Please help us make Percona XtraDB Cluster better by reporting any
bugs at https://bugs.launchpad.net/percona-xtradb-cluster

key_buffer_size=25165824
read_buffer_size=131072
max_used_connections=21
max_threads=202
thread_count=9
connection_count=7
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 105041 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x2b5e200
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7fab140a8d38 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x35)[0x906e45]
/usr/sbin/mysqld(handle_fatal_signal+0x4b4)[0x66ac44]
/lib64/libpthread.so.0(+0xf7e0)[0x7fab58eda7e0]
/usr/lib64/libgalera_smm.so(_Z20gcs_group_get_statusPK9gcs_groupRN2gu6StatusE+0x1a9)[0x7fab3c29f769]
/usr/lib64/libgalera_smm.so(_Z19gcs_core_get_statusP8gcs_coreRN2gu6StatusE+0x6f)[0x7fab3c2a31df]
/usr/lib64/libgalera_smm.so(_ZN6galera13ReplicatorSMM9stats_getEv+0xafe)[0x7fab3c308a5e]
/usr/sbin/mysqld(_Z17wsrep_show_statusP3THDP17st_mysql_show_varPc+0x25)[0x59da85]
/usr/sbin/mysqld[0x73428c]
/usr/sbin/mysqld(_Z11fill_statusP3THDP10TABLE_LISTP4Item+0xd6)[0x737c46]
/usr/sbin/mysqld(_Z24get_schema_tables_resultP4JOIN23enum_schema_table_state+0x2f1)[0x723ba1]
/usr/sbin/mysqld(_ZN4JOIN14prepare_resultEPP4ListI4ItemE+0x9d)[0x717d6d]
/usr/sbin/mysqld(_ZN4JOIN4execEv+0xdf)[0x6d02ef]
/usr/sbin/mysqld(_Z12mysql_selectP3THDP10TABLE_LISTjR4ListI4ItemEPS4_P10SQL_I_ListI8st_orderESB_S7_yP13select_resultP18st_select_lex_unitP13st_select_lex+0x250)[0x71a920]
/usr/sbin/mysqld(_Z13handle_selectP3THDP13select_resultm+0x187)[0x71b1a7]
/usr/sbin/mysqld[0x6ef5ed]
/usr/sbin/mysqld(_Z21mysql_execute_commandP3THD+0xe30)[0x6f1920]
/usr/sbin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_state+0x628)[0x6f7e18]
/usr/sbin/mysqld[0x6f7fb2]
/usr/sbin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcj+0x1896)[0x6fa1e6]
/usr/sbin/mysqld(_Z10do_commandP3THD+0x22b)[0x6fba9b]
/usr/sbin/mysqld(_Z24do_handle_one_connectionP3THD+0x17f)[0x6c216f]
/usr/sbin/mysqld(handle_one_connection+0x47)[0x6c2357]
/usr/sbin/mysqld(pfs_spawn_thread+0x12a)[0x994b6a]
/lib64/libpthread.so.0(+0x7aa1)[0x7fab58ed2aa1]
/lib64/libc.so.6(clone+0x6d)[0x7fab571c9bcd]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (7fa90408c2a0): is an invalid pointer
Connection ID (thread ID): 8853541
Status: NOT_KILLED

You may download the Percona XtraDB Cluster operations manual by visiting
http://www.percona.com/software/percona-xtradb-cluster/. You may find information
in the manual which will help you identify the cause of the crash.
171101 16:22:11 mysqld_safe Number of processes running now: 0
171101 16:22:11 mysqld_safe WSREP: not restarting wsrep node automatically
171101 16:22:11 mysqld_safe mysqld from pid file /var/lib/mysql/mysqld.pid ended

Krunal Bauskar (krunal-bauskar) wrote :

I assume you are hitting this bug https://jira.percona.com/browse/PXC-601.
Bug was regression introduced in 5.6.30 and was fixed in 5.6.32.

I would recommend you to re-try this with latest PXC 5.6.37. If you still hit the issue we would be happy to investigate it further.

Harun Küçük (harunkucuk5) wrote :

Hello Krunal,

Thank you for your response. I can not see this bug details.

https://jira.percona.com/browse/PXC-601.

The issue you're trying to view can't be displayed.
It may have been deleted or you don't have permission to view it right now.

Krunal Bauskar (krunal-bauskar) wrote :

It is probably because the issue was reported in JIRA few months back when the issue was JIRA was not public.

Anyway, let me point you to the said commit

commit 4392a6155b98161587a6a581b5d1f1f96a6b9f0e
Merge: 74216ed d0f5215
Author: Krunal Bauskar <email address hidden>
Date: Mon Jul 4 13:28:27 2016 +0530

    Merge pull request #67 from natsys/3.x-pxc-601

    - #PXC-601: show status while updating group state can cause an assert (likely race)

commit d0f521506ff3bce598b626fb88e710bfbae96415
Author: Julius Goryavsky <email address hidden>
Date: Wed Jun 29 21:24:07 2016 -0400

    The gcs_group_get_status function does not check that the group->my_idx can be equal to -1 (for example, when node receives self-leave message during non-primary state or technically due to implementation details of the gcs_group_handle_comp_msg function). Therefore, gcs_group_get_status function refers to memory beyond the array, which leads to SIGFAULT during execution of the to_string function.

    To fix this error we should not read desync counter until the gcs_group_handle_comp_msg function returns the group->my_idx to definite state.

    In addition, we should handle the situation where node is in the non-primary state and we got self-leave message. In this case node is desynchronized from the cluster, therefore I propose to return desync counter = 1.

Changed in percona-xtradb-cluster:
status: New → Invalid

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-2015

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers