Rejoining/recovery of galera cluster fails

Bug #1590166 reported by Paul Halmos on 2016-06-07
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
openstack-ansible
Status tracked in Trunk
Liberty
Undecided
Darren Birkett
Mitaka
Undecided
Jesse Pretorius
Trunk
High
Jesse Pretorius

Bug Description

Infra node was down for about 2 days due to hardware failure. Upon bringing the node back up it fails to rejoin the cluster with errors[1]. This appears to be a bug in percona-xtrabackup-2.3 with MariaDB10. [0] This is a critical failure as the only options to rejoin the cluster are to change the "wsrep_sst_method = xtrabackup-v2" to either "rsync" or "mysqldump". Neither of which are very good options due to speed or DB locking.

[0]
https://bugs.launchpad.net/percona-xtrabackup/+bug/1570560

[1]

InnoDB: Starting an apply batch of log records to the database...
InnoDB: Progress in percent: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 2016-06-07 16:47:12 7efea54c1700 InnoDB: Assertion failure in thread 139632160020224 in file page0cur.cc line 931
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
20:47:12 UTC - xtrabackup got signal 6 ;
This could be because you hit a bug or data is corrupted.
This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x10000
innobackupex(my_print_stacktrace+0x2e) [0x8f139e]
innobackupex(handle_fatal_signal+0x262) [0x7d1c82]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x10340) [0x7efea87a3340]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x39) [0x7efea6f07cc9]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x148) [0x7efea6f0b0d8]
innobackupex(page_cur_parse_insert_rec(unsigned long, unsigned char*, unsigned char*, buf_block_t*, dict_index_t*, mtr_t*)+0x8d5) [0x6db465]
innobackupex() [0x68249c]
innobackupex(recv_recover_page_func(unsigned long, buf_block_t*)+0x509) [0x684009]
innobackupex(buf_page_io_complete(buf_page_t*)+0x410) [0x6c1560]
innobackupex(fil_aio_wait(unsigned long)+0x138) [0x6f45f8]
innobackupex(io_handler_thread+0x28) [0x7215b8]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8182) [0x7efea879b182]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7efea6fcb47d]

summary: - Rejoining galera cluster fails
+ Rejoining/recovery of galera cluster fails
Changed in openstack-ansible:
importance: Undecided → High

Nobody got the opportunity to test this bug and was active during the bug triage of this week. Therefore, all the classification and work will be delayed to next week.

Sorry for the inconvenience.

Until xtrabackup fixes a patch, we won't be able to fix that.
In the meantime, if you ever were to encounter this, you could either destroy the container, or just remove everything in /var/lib/mysql, and an SST should allow the node to join the cluster - as long as the cluster is in 'primary component' status.

Please also note we didn't reproduce the bug (yet).

Darren Birkett (darren-birkett) wrote :

The linked bug with xtrabackup is actually not related to the bug encountered here. A more relevant bug may be:

https://bugs.launchpad.net/percona-xtrabackup/+bug/1250643

Is it possible to get more information on the versions of mariadb and xtrabackup in use at the time?
Has this been encountered multiple times in multiple environments, or was this a single occurrence?
Have we ever been able to replicate it?

Paul Halmos (paul-halmos) wrote :

Sorry for the delay. It has happened in multiple environments. This was one of them:

root@usw1infr001_galera_container-1a59697a:~# dpkg -l | grep mariadb
ii libmariadbclient-dev 10.0.25+maria-1~trusty amd64 MariaDB database development files
ii libmariadbclient18 10.0.25+maria-1~trusty amd64 MariaDB database client library
ii mariadb-client 10.0.25+maria-1~trusty all MariaDB database client (metapackage depending on the latest version)
ii mariadb-client-10.0 10.0.25+maria-1~trusty amd64 MariaDB database client binaries
ii mariadb-client-core-10.0 10.0.25+maria-1~trusty amd64 MariaDB database core client binaries
ii mariadb-common 10.0.25+maria-1~trusty all MariaDB database common files (e.g. /etc/mysql/conf.d/mariadb.cnf)
ii mariadb-galera-server-10.0 10.0.25+maria-1~trusty amd64 MariaDB database server with Galera cluster binaries

root@usw1infr001_galera_container-1a59697a:~# dpkg -l | grep xtrabackup
ii percona-xtrabackup-22 2.2.13-1.trusty amd64 Open source backup tool for InnoDB and XtraDB

eil397 (anton-haldin) wrote :

Hi Paul
I've spent some time trying to reproduce it without success yet.

I was using galera cluster (three nodes) with scenario:
step one : put one node down
step two : create some flow of changes for data in db ( update/insert/delete )
step three: try to rejoin node.

If you will be able to provide me any additional information, it would be great. Maybe you can provide me some raw estimates about how many changes happened ( for example new instances launched) .

Thanks.

Andrew Garner (abg) wrote :

This appears to be a manifestation of https://bugs.launchpad.net/percona-xtrabackup/+bug/1192834.

Just generating a flow of changes doesn't trip this bug - you have to generate undo (i.e. long-running transactions of some sort). See the test case in https://bugs.launchpad.net/percona-xtrabackup/+bug/1192834/comments/21.

Disabling the xtrabackup compact option seems to be the best course of action.

Darren Birkett (darren-birkett) wrote :

Thanks Andrew, I will go ahead an implement the recommendation to remove the --compact option.

Changed in openstack-ansible:
assignee: nobody → Darren Birkett (darren-birkett)
Changed in openstack-ansible:
status: New → In Progress
Changed in openstack-ansible:
assignee: Darren Birkett (darren-birkett) → Jesse Pretorius (jesse-pretorius)

Reviewed: https://review.openstack.org/355451
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible-galera_server/commit/?id=1f0cfa235acca67a953dfbff828d020da5dd734c
Submitter: Jenkins
Branch: master

commit 1f0cfa235acca67a953dfbff828d020da5dd734c
Author: Darren Birkett <email address hidden>
Date: Wed Aug 3 08:52:38 2016 +0100

    remove compact option from xtrabackup

    Using the --compact option with xtrabackup has been shown to cause
    crashes when used during an SST to transfer data between nodes:

    https://bugs.launchpad.net/percona-xtrabackup/+bug/1192834

    Based on DBA advice, we are disabling this option.

    Closes-Bug: #1590166
    Change-Id: I23fd5e36b74163fe97cf983cdc4b1d5678d94e7b

Changed in openstack-ansible:
status: In Progress → Fix Released

Reviewed: https://review.openstack.org/355685
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible-galera_server/commit/?id=af8e44dcb24ba04a4e1bd0437f532fbe7ac5d904
Submitter: Jenkins
Branch: stable/mitaka

commit af8e44dcb24ba04a4e1bd0437f532fbe7ac5d904
Author: Darren Birkett <email address hidden>
Date: Wed Aug 3 08:52:38 2016 +0100

    remove compact option from xtrabackup

    Using the --compact option with xtrabackup has been shown to cause
    crashes when used during an SST to transfer data between nodes:

    https://bugs.launchpad.net/percona-xtrabackup/+bug/1192834

    Based on DBA advice, we are disabling this option.

    Closes-Bug: #1590166
    Change-Id: I23fd5e36b74163fe97cf983cdc4b1d5678d94e7b
    (cherry picked from commit 1f0cfa235acca67a953dfbff828d020da5dd734c)

Reviewed: https://review.openstack.org/355821
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible/commit/?id=92ac36d5077e8fadcdf6f9944b1f40bedb64a495
Submitter: Jenkins
Branch: liberty

commit 92ac36d5077e8fadcdf6f9944b1f40bedb64a495
Author: Darren Birkett <email address hidden>
Date: Tue Aug 16 10:55:45 2016 +0100

    remove compact option from xtrabackup

    Using the --compact option with xtrabackup has been shown to cause
    crashes when used during an SST to transfer data between nodes:

    https://bugs.launchpad.net/percona-xtrabackup/+bug/1192834

    Based on DBA advice, we are disabling this option.

    Closes-Bug: #1590166
    Change-Id: I23fd5e36b74163fe97cf983cdc4b1d5678d94e7b

This issue was fixed in the openstack/openstack-ansible-galera_server 13.3.2 release.

This issue was fixed in the openstack/openstack-ansible 12.2.2 release.

This issue was fixed in the openstack/openstack-ansible-galera_server 14.0.0.0b3 development milestone.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers