non-leaders fail deployment with differing bootstrap uuids

Bug #1738896 reported by Corey Bryant
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Percona Cluster Charm
Fix Released
Medium
David Ames

Bug Description

When testing Percona 5.7 packaging on Bionic with the master branch of this charm, I regularly hit "bootstrap uuid differs" failures on non-leader units. These occurred during the leader-settings-changed hook. leader_settings_changed() calls update_bootstrap_uuid(), where the InconsistentUUIDError exception is raised.

Looking through the charm code, non-leaders were getting False returned from the is_bootstrapped() call in config_changed(). Because of this, these units weren't rendering config, resulting in running with the default versions of /etc/mysql/percona-xtradb-cluster.conf.d/mysqld.cnf. Templated versions of mysqld.cnf need to be rendered in order for successful clustering (ie. setting up wsrep_sst_auth, wsrep_cluster_name, wsrep_cluster_address and more need to be set correctly). This was happening because the non-leaders had no 'bootstrap-uuid' set over their cluster relations.

Looking at the leader unit, it does set bootstrap-uuid via relation_set() for all relation IDs as well as via leader_set() in notify_bootstrapped(). Also worth noting, leader_set() is called last.

It appears that non-leaders get into the leader-settings-changed hook too early, before configs are rendered.

Changed in charm-percona-cluster:
assignee: nobody → Corey Bryant (corey.bryant)
status: New → Triaged
importance: Undecided → Medium
description: updated
description: updated
description: updated
description: updated
description: updated
Revision history for this message
Corey Bryant (corey.bryant) wrote :

One thought for fixing this was to add an except for InconsistentUUIDError in leader_settings_changed() to handle the exception from update_bootstrap_uuid(). For example:

try:
    update_bootstrap_uuid()
except LeaderNoBootstrapUUIDError:
    status_set('waiting', "Waiting for bootstrap-uuid set by leader")
except InconsistentUUIDError:
    status_set('waiting', "config-changed hook likely hasn't rendered config yet")

The problem with this approach is that the leader-settings-changed hook might not fire again, and update_bootstrap_uuid() may not successfully get called.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Another thought is to delay the leader_set('bootstrap-uuid') that is currently called in notify_bootstrapped() (called in config-changed hook), since the issue seems to be the leader-settings-changed hook is firing earlier than it did in the past and too early for non-leaders.

If we can do this, then is_bootstrapped() should be successful on all non-leaders, allowing them to render their config prior to leader-settings-changed hook firing. The config-changed hook for non-leaders would need to be called after the leader calls relation_set('bootstrap-uuid') and before the leader calls leader_set('bootstrap-uuid'). Then, when leader-settings-changed hook is fired, update_bootstrap_uuid() should be successful for non-leaders and they'd call relation_set('bootstrap-uuid') with the leader's uuid.

Could we leader_set('bootstrap-uuid') in the cluster-relation-changed hook? This may need a new relation key for non-leaders to tell the leader that they've rendered their config.

David Ames (thedac)
Changed in charm-percona-cluster:
assignee: Corey Bryant (corey.bryant) → nobody
assignee: nobody → David Ames (thedac)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-percona-cluster (master)

Reviewed: https://review.openstack.org/531039
Committed: https://git.openstack.org/cgit/openstack/charm-percona-cluster/commit/?id=a3b43e16ece9da72b53fa8a267923b29c5573021
Submitter: Zuul
Branch: master

commit a3b43e16ece9da72b53fa8a267923b29c5573021
Author: David Ames <email address hidden>
Date: Wed Jan 3 16:05:46 2018 -0800

    Fix differing bootstrap uuids

    When is_bootstrapped is consulted it was returning a false negative.
    When the leader node is bootstrapped it sets the uuid via leader_set,
    but during a leader-settings-changed hook when the non-leader should
    be picking up the uuid it was checking relation data instead which will
    be way behind the curve. This is a vestigial block of code
    pre-leadership.

    Leader settings get set much earlier than relation data. This change
    consults leader_get rather than the relation. We also make sure the
    mysqld.cnf file is rendered on a leader-settings-changed hook.

    Change-Id: I95e56bd28152c934f413025a22dd6821b2ad8e94
    Closes-Bug: #1738896

Changed in charm-percona-cluster:
status: Triaged → Fix Committed
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Moving back to Triaged since the bug is still surfacing.

Changed in charm-percona-cluster:
status: Fix Committed → Triaged
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-percona-cluster (master)

Fix proposed to branch: master
Review: https://review.openstack.org/537020

Changed in charm-percona-cluster:
status: Triaged → In Progress
Revision history for this message
David Ames (thedac) wrote :

Previous attempts to solve Bug #1738896 missed the root cause. The
root cause problem is when the configuration file is rendered before
percona is installed. The rendering includes clustering configuration
which causes percona-cluster to automatically do a single bootstrap
when percona-cluster packages are installed leading to the UUID
mismatch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-percona-cluster (master)

Reviewed: https://review.openstack.org/537020
Committed: https://git.openstack.org/cgit/openstack/charm-percona-cluster/commit/?id=dc19ecb4a3e36999e0dda98a4ff040c8509152a2
Submitter: Zuul
Branch: master

commit dc19ecb4a3e36999e0dda98a4ff040c8509152a2
Author: David Ames <email address hidden>
Date: Tue Jan 23 18:01:21 2018 +0000

    Guarantee timing of installation and render

    Previous attempts to solve Bug #1738896 missed the root cause. The
    root cause problem is when the configuration file is rendered before
    percona is installed. The rendering includes clustering configuration
    which causes percona-cluster to automatically do a single bootstrap
    when percona-cluster packages are installed leading to the UUID
    mismatch.

    The timing and ordering of installation, rendering of the
    configuration and restart of mysql is critical across all hook
    executions.

    This change is a partial reversion of Change ID
    I95e56bd28152c934f413025a22dd6821b2ad8e94. The change primarily
    guarantees percona-cluster is not installed on non-leader nodes
    before the leader is bootstrapped and makes sure the configuration
    does not get rendered prior to installation.

    is_leader_bootstrapped is introduced and guarantees all data expected
    from the leader is available to guard on various tasks.

    Closes-Bug: #1744961
    Closes-Bug: #1738896
    Change-Id: Ifeb1520dba3b14fc1b51a586141905a385f2b2c1

Changed in charm-percona-cluster:
status: In Progress → Fix Committed
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

We hit this with bd5474c, which is tip of charm-percona-cluster right now, so it appears to still not be fixed.

Changed in charm-percona-cluster:
status: Fix Committed → New
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
Revision history for this message
Ashley Lai (alai) wrote :

We just hit this again. I've attached the crashdump for investigation.

Revision history for this message
David Ames (thedac) wrote :

Root Cause

After many incorrect theories (and implementations of said theories) of what the cause of the mismatched UUIDs, it comes down to this:

If the mysqld.cnf has the following rendered, on start up percona-cluster >= 5.6 will generate a UUID assuming it is a new cluster.

 wsrep_cluster_address=gcomm://

If a non-leader node has this configuration and starts (or restarts), it will have an independent UUID and it will conflict with the leader's UUID.

Our mysqld.cnf template was not helping. When clustered=False it would configure the auto-bootstrapping configuration.
https://github.com/openstack/charm-percona-cluster/blob/master/templates/mysqld.cnf#L110

 {% if not clustered -%}
 # Empty gcomm address is being used when cluster is getting bootstrapped
 wsrep_cluster_address=gcomm://
 {else -%}
 ...

* NOTE: The new appearance of this bug in <=5.6 is due to a change in Juju behavior where the leader-settings-changed hook can fire immediately after the install hook. This is a race the charm was not previously prepared for.

*NOTE: For percona-cluster 5.7 the wsrep.cnf file will also cause this problem. Such that, even at install time the problem occurs. This is probably the reason for initially filing this version of the bug. James tells me this has been removed from the package. But if it reappears it will need to be addressed.

* NOTE: The default package configuration (without wsrep.cnf), can start mysql without auto-bootstrapping. Even, other wsrep configuration parameters can be configured without issue. It is the empty gmcomm address that is the root cause.

This change addresses the problem in two ways. One, the template is guarded so that only the leader can render the auto bootstrapping configuration. Two, code in the config_changed function guarantees non-leaders do not pass values that would render this scenario.

 https://review.openstack.org/#/c/538979/

Changed in charm-percona-cluster:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/538979
Committed: https://git.openstack.org/cgit/openstack/charm-percona-cluster/commit/?id=fac45afc6057a2fd18ddb59423965e521e1963e8
Submitter: Zuul
Branch: master

commit fac45afc6057a2fd18ddb59423965e521e1963e8
Author: David Ames <email address hidden>
Date: Mon Jan 29 09:05:03 2018 -0800

    Gate db{-admin} relations until cluster is ready

    Percona-cluster was responding to db and db-admin relations before it
    was ready. This led to the error: "WSREP has not yet prepared node for
    application use."

    This change applies the same gating share-db relation already has to db
    and db-admin relations. It also condenses code used in both instances.

    This change guarantees the rendered configuration will not
    auto-bootstrap for non-leaders. This addresses Bug 1738896.

    Closes-Bug: #1742683
    Closes-Bug: #1738896
    Change-Id: If525595fd109e6a738071a3f016b9c2eabec529e

Changed in charm-percona-cluster:
status: In Progress → Fix Committed
James Page (james-page)
Changed in charm-percona-cluster:
milestone: none → 18.02
Ryan Beisner (1chb1n)
Changed in charm-percona-cluster:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.