OpenStack-Ansible

Galera cluster initialisation failures

Bug #1532761 reported by Jesse Pretorius on 2016-01-11

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
OpenStack-Ansible	Fix Released	Medium	Darren Birkett	OpenStack-Ansible mitaka-2
Kilo	Fix Released	Medium	Jesse Pretorius	OpenStack-Ansible 11.2.7
Liberty	Fix Released	Medium	Jesse Pretorius	OpenStack-Ansible 12.0.4
Trunk	Fix Released	Medium	Darren Birkett	OpenStack-Ansible mitaka-2

Bug Description

There is a fairly regular failure in the initialisation of Galera in the gate. On the restart of the first cluster node's service, it fails - for example:

2016-01-11 08:57:16.420 | NOTIFIED: [galera_server | Restart mysql] *************************************
2016-01-11 08:57:16.420 | failed: [aio1_galera_container-89fd4370] =>
2016-01-11 08:57:16.420 | msg: * Stopping MariaDB database server mysqld
2016-01-11 08:57:16.420 | ...done.
2016-01-11 08:57:16.420 | * Starting MariaDB database server mysqld
2016-01-11 08:57:16.420 | ...fail!

Logstash query: http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22msg%3A%20The%20cluster%20may%20be%20broken%2C%20mysql%20is%20not%20running%20but%20appears%20to%20be%20installed.*%5C%22

Revision history for this message

Darren Birkett (darren-birkett) wrote on 2016-01-11:

Looking at logstash in openstack-infra, it seems that this issue has only started happening in the last week (8 times), and is limited to hp cloud instances. Given HP cloud is being wound down, I'm not convinced that this issue is worth spending time on (especially since it is provider specific and difficult to reproduce).

I'd leave it open for a bit, just in case it starts happening more frequently or with other providers.

Changed in openstack-ansible:
assignee:	nobody → Darren Birkett (darren-birkett)
status:	New → Triaged

Revision history for this message

Jesse Pretorius (jesse-pretorius) wrote on 2016-01-11:

@Darren The message in later versions is a little different - this query covers both Kilo and Liberty and shows that this is not only HP Cloud: http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22The%20cluster%20may%20be%20broken%5C%22

Revision history for this message

Jesse Pretorius (jesse-pretorius) wrote on 2016-01-11:

Master patch merged: https://review.openstack.org/256016

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-01-11: Fix proposed to openstack-ansible (kilo)

Fix proposed to branch: kilo
Review: https://review.openstack.org/265910

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-01-11: Fix proposed to openstack-ansible (liberty)

Fix proposed to branch: liberty
Review: https://review.openstack.org/265915

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-01-14: Fix merged to openstack-ansible (liberty)

Reviewed: https://review.openstack.org/265915
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible/commit/?id=d839e2e4f950b08b6fb28169da16f56449797fb8
Submitter: Jenkins
Branch: liberty

commit d839e2e4f950b08b6fb28169da16f56449797fb8
Author: Jesse Pretorius <email address hidden>
Date: Mon Jan 11 16:36:49 2016 +0000

Resolve MariaDB/Galera cluster startup/logging issues

    This patch ensures that MariaDB is given adequate time to start on a
    resources constrained system (180s versus the default of 30s),
    ensures that the error log is appropriately populated and also
    provides a failback restart in the case where there may be a corrupt
    sst directory.

    In the handler changes:
     - the environment variable "MYSQLD_STARTUP_TIMEOUT" is now being
       passed into the init script because the defaults are not being
       sourced at the init script runtime.
     - the temporary "sst" directory is cleaned up should the handler
       restart fail. This ensurez that a node is in a clean state if a
       leftover sst directory was on the disk which would cause a node
       to fail to join a cluster or bootstrap.

    In the task changes a new configuration file, that is part of the
    mariadb package, is being removed which has unforeseen options within
    it causing no logs to be created.

    The default option "galera_innodb_additional_mem_pool_size" was removed
    because its no longer valid within MariaDB10 and we'd never caught that
    error message until now.

    This patch is based on:
     - https://review.openstack.org/256016
     - https://review.openstack.org/266265

    Closes-Bug: #1532761
    Closes-Bug: #1533126
    Change-Id: I16af30c660790656fc2d59f9943c172b88098905

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-01-18: Fix proposed to openstack-ansible (kilo)

Fix proposed to branch: kilo
Review: https://review.openstack.org/268975

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-01-18: Change abandoned on openstack-ansible (kilo)

Change abandoned by Jesse Pretorius (<email address hidden>) on branch: kilo
Review: https://review.openstack.org/265910
Reason: This is included in https://review.openstack.org/268975

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-01-18: Fix merged to openstack-ansible (kilo)

Reviewed: https://review.openstack.org/268975
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible/commit/?id=4a401125e46ca28d3d8848ea194737f1e17f6992
Submitter: Jenkins
Branch: kilo

commit 4a401125e46ca28d3d8848ea194737f1e17f6992
Author: Jesse Pretorius <email address hidden>
Date: Mon Jan 11 16:24:38 2016 +0000

Resolve MariaDB/Galera cluster startup/logging issues

    In the task changes:
     - a new configuration file (part of the mariadb package) is being
       removed which has unforeseen options within it causing no logs
       to be created.
     - a mysql ping check is implemented to verify that the service is
       responding after the restart handler is fired.

    This patch is based on:
     - https://review.openstack.org/256016
     - https://review.openstack.org/266265
     - https://review.openstack.org/268707

    Closes-Bug: #1532761
    Closes-Bug: #1533126
    Change-Id: I16af30c660790656fc2d59f9943c172b88098905

Wait for galera to respond after restarts

Add a mysql ping check to verify the service is responding
after a restart handler is fired.

Change-Id: Idfc1e1a1113ab0ffa221e4c0a4cc074df23fe89a
(cherry picked from commit f6fb63f3477e7cdada1a1be8d670755a0e4e6f0b)

Reviewed:  https://review.openstack.org/268975
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible/commit/?id=4a401125e46ca28d3d8848ea194737f1e17f6992
Submitter: Jenkins
Branch:    kilo

commit 4a401125e46ca28d3d8848ea194737f1e17f6992
Author: Jesse Pretorius <jesse.pretorius@rackspace.co.uk>
Date:   Mon Jan 11 16:24:38 2016 +0000

Resolve MariaDB/Galera cluster startup/logging issues
    
    This patch ensures that MariaDB is given adequate time to start on a
    resources constrained system (180s versus the default of 30s),
    ensures that the error log is appropriately populated and also
    provides a failback restart in the case where there may be a corrupt
    sst directory.
    
    In the handler changes:
     - the environment variable "MYSQLD_STARTUP_TIMEOUT" is now being
       passed into the init script because the defaults are not being
       sourced at the init script runtime.
     - the temporary "sst" directory is cleaned up should the handler
       restart fail. This ensurez that a node is in a clean state if a
       leftover sst directory was on the disk which would cause a node
       to fail to join a cluster or bootstrap.
    
    In the task changes:
     - a new configuration file (part of the mariadb package) is being
       removed which has unforeseen options within it causing no logs
       to be created.
     - a mysql ping check is implemented to verify that the service is
       responding after the restart handler is fired.
    
    This patch is based on:
     - https://review.openstack.org/256016
     - https://review.openstack.org/266265
     - https://review.openstack.org/268707
    
    Closes-Bug: #1532761
    Closes-Bug: #1533126
    Change-Id: I16af30c660790656fc2d59f9943c172b88098905
    
    Wait for galera to respond after restarts
    
    Add a mysql ping check to verify the service is responding
    after a restart handler is fired.
    
    Change-Id: Idfc1e1a1113ab0ffa221e4c0a4cc074df23fe89a
    (cherry picked from commit f6fb63f3477e7cdada1a1be8d670755a0e4e6f0b)

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2016-03-18: Fix included in openstack/openstack-ansible 11.2.11

#10

This issue was fixed in the openstack/openstack-ansible 11.2.11 release.

Revision history for this message

Doug Hellmann (doug-hellmann) wrote on 2016-03-18: Fix included in openstack/openstack-ansible 12.0.8

#11

This issue was fixed in the openstack/openstack-ansible 12.0.8 release.

Revision history for this message

Doug Hellmann (doug-hellmann) wrote on 2016-03-25: Fix included in openstack/openstack-ansible 11.2.12

#12

This issue was fixed in the openstack/openstack-ansible 11.2.12 release.

Revision history for this message

Doug Hellmann (doug-hellmann) wrote on 2016-03-30: Fix included in openstack/openstack-ansible 12.0.9

#13

This issue was fixed in the openstack/openstack-ansible 12.0.9 release.

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2016-05-03: Fix included in openstack/openstack-ansible 12.0.11

#14

This issue was fixed in the openstack/openstack-ansible 12.0.11 release.

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2016-05-03: Fix included in openstack/openstack-ansible 11.2.14

#15

This issue was fixed in the openstack/openstack-ansible 11.2.14 release.

Revision history for this message

Doug Hellmann (doug-hellmann) wrote on 2016-05-05: Fix included in openstack/openstack-ansible 11.2.15

#16

This issue was fixed in the openstack/openstack-ansible 11.2.15 release.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.