HA scenario is broken: galera bootstrap fails

Bug #1622613 reported by Emilien Macchi
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned
Liberty
Fix Released
Critical
Unassigned
Mitaka
Fix Released
Critical
Unassigned

Bug Description

This new error happens when deploying a controller with Pacemaker and MySQL Galera:

Error: /Stage[main]/Tripleo::Profile::Pacemaker::Database::Mysql/Exec[galera-set-root-password]/returns: change from notrun to 0 failed

After some investigation, it looks like it's related to the latest update of Galera package.

Tags: ci
Revision history for this message
Emilien Macchi (emilienm) wrote :
Changed in tripleo:
importance: Undecided → Critical
summary: - galera-25.3.12-2 breaks HA scenario
+ HA scenario is broken: galera bootstrap fails
Revision history for this message
Emilien Macchi (emilienm) wrote :

Here's a package difference between a very recent working job and a failing job:
https://www.diffchecker.com/t6HnUxle

tags: added: alert ci
Changed in tripleo:
milestone: none → newton-rc1
Revision history for this message
Damien Ciabrini (dciabrin) wrote :

Looking at the /var/log/mysql.log, on this particular run, controller0 was selected as galera's bootstrap node, and succesfully started.

The other two nodes controller1 and controller2 acted as "joiners" and none of them could connect to the bootstrap node to join the galera cluster.

Investigating why this could be the case

Revision history for this message
Ryan Hallisey (rthall14) wrote :

I'm also seeing this.

Revision history for this message
Emilien Macchi (emilienm) wrote :

so tripleo CI will be fixed by removing EPEL, so we can download the version we know working from RDO repository: https://review.openstack.org/#/c/347499/

I'm closing the bug, but feel free to re-open a new one if you think something is wrong in our Galera deployment manifests.

tags: removed: alert
Changed in tripleo:
status: New → Fix Released
Revision history for this message
Damien Ciabrini (dciabrin) wrote :

Update as to why the bootstrap of Galera cluster was failing:

We found out with Mike Bayer that version 25.3.12 of the galera WSREP provider behaves differently than version 25.3.5 when processing the provider option:

   wsrep_provider_options = gmcast.listen_addr=tcp://[172.16.2.15]:4567

This value is being set by t-h-t and has square brackets in the address to tell galera to listen to either an IPv4 or IPv6 address at startup.

In newer version of the galera provider, joining nodes won't connect to the cluster if the listen address is an IPv4 and contains brackets.

Removing brackets makes the bootstrap work again for IPv4 use cases.
Checks for IPv6 would need to be done to check whether brackets are still working as expected.

Revision history for this message
Ben Nemec (bnemec) wrote :

We're going to merge https://review.openstack.org/#/c/347499/ to unblock master ci, but we need to discuss a solution for the stable branches since there are deployments out there (including our ci cloud) that already have epel enabled and will break if they update to this newer package. If there's a reasonably backportable way to remove the brackets in the ipv4 case that would be excellent.

Revision history for this message
Emilien Macchi (emilienm) wrote :

See https://bugs.launchpad.net/tripleo/+bug/1622755 for following-up the actual galera issue.

Revision history for this message
Ben Nemec (bnemec) wrote :

I'm going to mark this as fix released per https://bugs.launchpad.net/tripleo/+bug/1622755 since that should address the problem.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.