Galera cluster initialisation failures

Bug #1532761 reported by Jesse Pretorius
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack-Ansible
Fix Released
Medium
Darren Birkett
Kilo
Fix Released
Medium
Jesse Pretorius
Liberty
Fix Released
Medium
Jesse Pretorius
Trunk
Fix Released
Medium
Darren Birkett

Bug Description

There is a fairly regular failure in the initialisation of Galera in the gate. On the restart of the first cluster node's service, it fails - for example:

2016-01-11 08:57:16.420 | NOTIFIED: [galera_server | Restart mysql] *************************************
2016-01-11 08:57:16.420 | failed: [aio1_galera_container-89fd4370] =>
2016-01-11 08:57:16.420 | msg: * Stopping MariaDB database server mysqld
2016-01-11 08:57:16.420 | ...done.
2016-01-11 08:57:16.420 | * Starting MariaDB database server mysqld
2016-01-11 08:57:16.420 | ...fail!

Logstash query: http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22msg%3A%20The%20cluster%20may%20be%20broken%2C%20mysql%20is%20not%20running%20but%20appears%20to%20be%20installed.*%5C%22

Revision history for this message
Darren Birkett (darren-birkett) wrote :

Looking at logstash in openstack-infra, it seems that this issue has only started happening in the last week (8 times), and is limited to hp cloud instances. Given HP cloud is being wound down, I'm not convinced that this issue is worth spending time on (especially since it is provider specific and difficult to reproduce).

I'd leave it open for a bit, just in case it starts happening more frequently or with other providers.

Changed in openstack-ansible:
assignee: nobody → Darren Birkett (darren-birkett)
status: New → Triaged
Revision history for this message
Jesse Pretorius (jesse-pretorius) wrote :

@Darren The message in later versions is a little different - this query covers both Kilo and Liberty and shows that this is not only HP Cloud: http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22The%20cluster%20may%20be%20broken%5C%22

Revision history for this message
Jesse Pretorius (jesse-pretorius) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-ansible (kilo)

Fix proposed to branch: kilo
Review: https://review.openstack.org/265910

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-ansible (liberty)

Fix proposed to branch: liberty
Review: https://review.openstack.org/265915

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-ansible (liberty)

Reviewed: https://review.openstack.org/265915
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible/commit/?id=d839e2e4f950b08b6fb28169da16f56449797fb8
Submitter: Jenkins
Branch: liberty

commit d839e2e4f950b08b6fb28169da16f56449797fb8
Author: Jesse Pretorius <email address hidden>
Date: Mon Jan 11 16:36:49 2016 +0000

    Resolve MariaDB/Galera cluster startup/logging issues

    This patch ensures that MariaDB is given adequate time to start on a
    resources constrained system (180s versus the default of 30s),
    ensures that the error log is appropriately populated and also
    provides a failback restart in the case where there may be a corrupt
    sst directory.

    In the handler changes:
     - the environment variable "MYSQLD_STARTUP_TIMEOUT" is now being
       passed into the init script because the defaults are not being
       sourced at the init script runtime.
     - the temporary "sst" directory is cleaned up should the handler
       restart fail. This ensurez that a node is in a clean state if a
       leftover sst directory was on the disk which would cause a node
       to fail to join a cluster or bootstrap.

    In the task changes a new configuration file, that is part of the
    mariadb package, is being removed which has unforeseen options within
    it causing no logs to be created.

    The default option "galera_innodb_additional_mem_pool_size" was removed
    because its no longer valid within MariaDB10 and we'd never caught that
    error message until now.

    This patch is based on:
     - https://review.openstack.org/256016
     - https://review.openstack.org/266265

    Closes-Bug: #1532761
    Closes-Bug: #1533126
    Change-Id: I16af30c660790656fc2d59f9943c172b88098905

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-ansible (kilo)

Fix proposed to branch: kilo
Review: https://review.openstack.org/268975

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on openstack-ansible (kilo)

Change abandoned by Jesse Pretorius (<email address hidden>) on branch: kilo
Review: https://review.openstack.org/265910
Reason: This is included in https://review.openstack.org/268975

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-ansible (kilo)

Reviewed: https://review.openstack.org/268975
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible/commit/?id=4a401125e46ca28d3d8848ea194737f1e17f6992
Submitter: Jenkins
Branch: kilo

commit 4a401125e46ca28d3d8848ea194737f1e17f6992
Author: Jesse Pretorius <email address hidden>
Date: Mon Jan 11 16:24:38 2016 +0000

    Resolve MariaDB/Galera cluster startup/logging issues

    This patch ensures that MariaDB is given adequate time to start on a
    resources constrained system (180s versus the default of 30s),
    ensures that the error log is appropriately populated and also
    provides a failback restart in the case where there may be a corrupt
    sst directory.

    In the handler changes:
     - the environment variable "MYSQLD_STARTUP_TIMEOUT" is now being
       passed into the init script because the defaults are not being
       sourced at the init script runtime.
     - the temporary "sst" directory is cleaned up should the handler
       restart fail. This ensurez that a node is in a clean state if a
       leftover sst directory was on the disk which would cause a node
       to fail to join a cluster or bootstrap.

    In the task changes:
     - a new configuration file (part of the mariadb package) is being
       removed which has unforeseen options within it causing no logs
       to be created.
     - a mysql ping check is implemented to verify that the service is
       responding after the restart handler is fired.

    This patch is based on:
     - https://review.openstack.org/256016
     - https://review.openstack.org/266265
     - https://review.openstack.org/268707

    Closes-Bug: #1532761
    Closes-Bug: #1533126
    Change-Id: I16af30c660790656fc2d59f9943c172b88098905

    Wait for galera to respond after restarts

    Add a mysql ping check to verify the service is responding
    after a restart handler is fired.

    Change-Id: Idfc1e1a1113ab0ffa221e4c0a4cc074df23fe89a
    (cherry picked from commit f6fb63f3477e7cdada1a1be8d670755a0e4e6f0b)

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote : Fix included in openstack/openstack-ansible 11.2.11

This issue was fixed in the openstack/openstack-ansible 11.2.11 release.

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/openstack-ansible 12.0.8

This issue was fixed in the openstack/openstack-ansible 12.0.8 release.

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/openstack-ansible 11.2.12

This issue was fixed in the openstack/openstack-ansible 11.2.12 release.

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/openstack-ansible 12.0.9

This issue was fixed in the openstack/openstack-ansible 12.0.9 release.

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote : Fix included in openstack/openstack-ansible 12.0.11

This issue was fixed in the openstack/openstack-ansible 12.0.11 release.

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote : Fix included in openstack/openstack-ansible 11.2.14

This issue was fixed in the openstack/openstack-ansible 11.2.14 release.

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/openstack-ansible 11.2.15

This issue was fixed in the openstack/openstack-ansible 11.2.15 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.