Secondary nodes fail to join galera cluster

Bug #1569446 reported by Jimmy McCrory
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack-Ansible
Fix Released
Critical
Darren Birkett
Liberty
Fix Released
Critical
Darren Birkett
Mitaka
Fix Released
Critical
Darren Birkett
Trunk
Fix Released
Critical
Darren Birkett

Bug Description

So far I've only seen this while running the galera_server test playbooks and even then it's sporadic.

The initial state transfer from the primary node seems to fail, preventing mysql on secondary nodes from starting and joining the cluster.

http://logs.openstack.org/90/303090/19/check/gate-openstack-ansible-galera_server-ansible-func-ubuntu-trusty/a6ff46b/console.html#_2016-04-10_07_10_41_547

I was able to reproduce it locally and captured logs.
container2
==========
/var/log/mysql_logs/galera_server_error.log
http://paste.openstack.org/show/493594/

container1
==========
/var/log/mysql_logs/galera_server_error.log
http://paste.openstack.org/show/493595/

/var/lib/mysql/innobackup.backup.log
http://paste.openstack.org/show/493596/

Changed in openstack-ansible:
status: New → Confirmed
milestone: none → newton-1
importance: Undecided → Critical
Changed in openstack-ansible:
assignee: nobody → Darren Birkett (darren-birkett)
Revision history for this message
Jesse Pretorius (jesse-pretorius) wrote :

Marking this as critical as it's directly affecting patches merging in the galera_server role, and also has an effect on any production environments from Liberty onwards (at least).

Revision history for this message
Jimmy McCrory (jimmy-mccrory) wrote :

Currently xtrabackup 2.2.13 is being installed.
The most recent stable release is 2.4.2.

The only really interesting thing I've noticed so far in change logs between versions was the innobackupex script being rewritten in C (2.3.1-beta1).

https://www.percona.com/doc/percona-xtrabackup/2.4/release-notes.html

Still need testing to see whether using a more recent package has any effect on this issue and what else would possibly need to change in configs or when upgrading.

Revision history for this message
Darren Birkett (darren-birkett) wrote :

My current working theory is that for some reason, on the nodes that fail to join, mysqld_safe is kicking in and trying to start another mysqld daemon which is messing with the in-progress SST:

https://gist.github.com/mancdaz/61f5529e77f2c1fce7964c0bbaa6d14a

Revision history for this message
Darren Birkett (darren-birkett) wrote :

tested xtrabackup 2.3.4 and am seeing the same problem.

N.B in testing I have removed the upgrade play to just focus on the initial cluster build issues. Also, I have set the number of retries on the mysql restart handler to 1, and removed any secondary handlers from being called. There is a different, separate, issue that you will see on subsequent attempts if the first one fails, and so I did this to isolate the issue to the initial failure. Changes I made can be found in this fork:

https://github.com/mancdaz/openstack-ansible-galera_server/tree/1569446

Revision history for this message
Jimmy McCrory (jimmy-mccrory) wrote :

Thanks for that, definitely simplifies testing and good find with your recent logs.

I haven't been able to reproduce yet today on the couple machines I have for testing, but will keep rebuilding some more tonight.

Revision history for this message
Darren Birkett (darren-birkett) wrote :

So some more things discovered:

The default value of MYSQLD_STARTUP_TIMEOUT is 30

the defaults file that we drop into /etc/mysql/default is essentially ignored, at least in terms of obeying MYSQLD_STARTUP_TIMEOUT=1800

The environment variable that we set in the test playbook (MYSQLD_STARTUP_TIMEOUT=180) is also ignored by the mysql initscript

A successful startup, when doing an SST, can take around 32 seconds.

It's possible that the startup is timing out before the SST is completed. The randomness of success could be explained by some startups taking a bit less than 30 seconds.

To test - work out how we successfully set the MYSQLD_STARTUP_TIMEOUT to a high value, and see if we eliminate the error. The only way I've found to do this right now is to manually edit the initscript (/etc/init.d/mysql)

Revision history for this message
Jean-Philippe Evrard (jean-philippe-evrard) wrote :

FYI, I took Darren repo as a source, but replaced the xtrabackup version to the current role's one.

Couldn't reproduce the issue right now.

Revision history for this message
Darren Birkett (darren-birkett) wrote :

So, to test the above theory I added a lineinfile to edit the timeout directly in the initscript, rather than relying on the failing defaults file or environment variable. As yet in 3 runs, I have not seen the join issue :) I'd appreciate more testing of this theory...updated fork can be found here:

https://github.com/mancdaz/openstack-ansible-galera_server/commit/52f4fe9f2b7e6e0683887e86644fe610b4ec4a57#diff-f49431d5bf5e7b643437e76ade9b3e35R58

The question now is how to address this:

- patch the initscript to source the defaults file
- patch the initscript to set the timeout higher (as above)
- carry the entire fixed initscript

Revision history for this message
Darren Birkett (darren-birkett) wrote :

OK I'm pretty sure this fixes the issue. I just reproduced my success another 6 times :)

Revision history for this message
Jimmy McCrory (jimmy-mccrory) wrote :

Looks like that's the issue, nice job.

I tried overwriting the init script on secondary nodes with MYSQLD_STARTUP_TIMEOUT hardcoded to 10 and got a failure every time.

The environment being ignored is being caused by the /usr/sbin/service command.

It'll look for an init script in /etc/init/ first, otherwise it goes to /etc/init.d/ (where mysql's is) while completely ignoring existing environment variables...

# Otherwise, use the traditional sysvinit
if [ -x "${SERVICEDIR}/${SERVICE}" ]; then
   exec env -i LANG="$LANG" PATH="$PATH" TERM="$TERM" "$SERVICEDIR/$SERVICE" ${ACTION} ${OPTIONS}

The mysql init script also doesn't seem to source /etc/default/mysql at any point, where we're also setting MYSQLD_STARTUP_TIMEOUT.

Revision history for this message
Jimmy McCrory (jimmy-mccrory) wrote :

I upgraded mariadb to 10.1.13 out of curiosity.

It still drops its init script to /etc/init.d/mysql, but that script now includes a source of /etc/default/mariadb.

Think a lineinfile edit may be the best option for now, but just something to keep in mind whenever we do upgrade the package itself again.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-ansible-galera_server (master)

Fix proposed to branch: master
Review: https://review.openstack.org/306298

Changed in openstack-ansible:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-ansible-galera_server (master)

Reviewed: https://review.openstack.org/306298
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible-galera_server/commit/?id=b9a7361b28a7acb2dd05ef960cf474dc91486a6a
Submitter: Jenkins
Branch: master

commit b9a7361b28a7acb2dd05ef960cf474dc91486a6a
Author: Darren Birkett <email address hidden>
Date: Fri Apr 15 09:12:32 2016 +0100

    ensure mysql defaults file is sourced

    The mysql initscript that comes with mariadb-10.0 does not source the
    /etc/default/mysql file, meaning that the MYSQLD_STARTUP_TIMEOUT that
    we set in there is being ignored. This has the effect of causing some
    nodes to timeout their startup too early, interrupting the initial SST,
    and causing the cluster join to fail.

    This commit adds a line to the initscript to ensure that we are
    sourcing /etc/default/mysql correctly.

    Closes-Bug: #1569446
    Co-Authored-By: Jimmy McCrory <email address hidden>

    Change-Id: I6af1cc660c2bbc2a49707cfab995659dc0167f2f

Changed in openstack-ansible:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-ansible-galera_server (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/306440

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-ansible (liberty)

Fix proposed to branch: liberty
Review: https://review.openstack.org/306450

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-ansible-galera_server (stable/mitaka)

Reviewed: https://review.openstack.org/306440
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible-galera_server/commit/?id=a743ea9689b7de12539d42eef63f90d4bfa5bc1c
Submitter: Jenkins
Branch: stable/mitaka

commit a743ea9689b7de12539d42eef63f90d4bfa5bc1c
Author: Darren Birkett <email address hidden>
Date: Fri Apr 15 09:12:32 2016 +0100

    ensure mysql defaults file is sourced

    The mysql initscript that comes with mariadb-10.0 does not source the
    /etc/default/mysql file, meaning that the MYSQLD_STARTUP_TIMEOUT that
    we set in there is being ignored. This has the effect of causing some
    nodes to timeout their startup too early, interrupting the initial SST,
    and causing the cluster join to fail.

    This commit adds a line to the initscript to ensure that we are
    sourcing /etc/default/mysql correctly.

    Closes-Bug: #1569446
    Co-Authored-By: Jimmy McCrory <email address hidden>

    Change-Id: I6af1cc660c2bbc2a49707cfab995659dc0167f2f
    (cherry picked from commit b9a7361b28a7acb2dd05ef960cf474dc91486a6a)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-ansible (liberty)

Reviewed: https://review.openstack.org/306450
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible/commit/?id=a71abb89e892a57553bfb3d147f7543a3fbe6c92
Submitter: Jenkins
Branch: liberty

commit a71abb89e892a57553bfb3d147f7543a3fbe6c92
Author: Darren Birkett <email address hidden>
Date: Fri Apr 15 14:59:56 2016 +0100

    ensure mysql defaults file is sourced

    The mysql initscript that comes with mariadb-10.0 does not source the
    /etc/default/mysql file, meaning that the MYSQLD_STARTUP_TIMEOUT that
    we set in there is being ignored. This has the effect of causing some
    nodes to timeout their startup too early, interrupting the initial SST,
    and causing the cluster join to fail.

    This commit adds a line to the initscript to ensure that we are
    sourcing /etc/default/mysql correctly.

    Closes-Bug: #1569446
    Co-Authored-By: Jimmy McCrory <email address hidden>

    Change-Id: I6af1cc660c2bbc2a49707cfab995659dc0167f2f

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/openstack-ansible-galera_server 13.0.1

This issue was fixed in the openstack/openstack-ansible-galera_server 13.0.1 release.

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :

This issue was fixed in the openstack/openstack-ansible-galera_server 13.0.1 release.

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote : Fix included in openstack/openstack-ansible 12.0.11

This issue was fixed in the openstack/openstack-ansible 12.0.11 release.

Revision history for this message
Thierry Carrez (ttx) wrote : Fix included in openstack/openstack-ansible-galera_server 14.0.0.0b1

This issue was fixed in the openstack/openstack-ansible-galera_server 14.0.0.0b1 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.