Galera agent doesn't work when grastate.dat contains safe_to_bootstrap

Bug #1789527 reported by Aymen Frikha on 2018-08-29
This bug affects 1 person
Affects Status Importance Assigned to Milestone
resource-agents (Ubuntu)
Andreas Hasenack

Bug Description

Galera resource agent is not able to put mysql up and master even if safe_to_bootstrap flag in grastate.dat is set to 1.

* res_percona_promote_0 on 09fde2-2 'unknown error' (1): call=1373, status=complete, exitreason='MySQL server failed to start (pid=2432) (rc=0), please check your installation',

The resource agent is not able to handle safe_to_bootstrap feature in galera:

I use percona cluster database which uses the same galera mechanism for clustering.

Packages I use in Xenial:

resource-agents 3.9.7-1
percona-xtradb-cluster-server-5.6 5.6.37-26.21-0ubuntu0.16.04.2
pacemaker 1.1.14-2ubuntu1.4
corosync 2.3.5-3ubuntu2.1

A workaround exist in :
A fix also exist but it was not addressed to xenial package:

Is it possible to add this fix on the recent package of resource-agents in Xenial ?

Andreas Hasenack (ahasenack) wrote :

Thanks for bringing this up together with an upstream patch.

Changed in resource-agents (Ubuntu):
status: New → Triaged
importance: Undecided → Medium
Christian Reis (kiko) wrote :
tags: added: server-next
Changed in resource-agents (Ubuntu):
importance: Medium → High
Changed in resource-agents (Ubuntu):
assignee: nobody → Andreas Hasenack (ahasenack)
status: Triaged → In Progress
Andreas Hasenack (ahasenack) wrote :

I have patched packages in this ppa if someone wants to try them:

I couldn't reproduce the bug so far. I deployed the percona-cluster charm on xenial using -n 3, using lxd, and then used lxd stop -f on all units to force-kill them. They came back up just fine. Then I used the charm from, same result. That charm doesn't get me resource-agents installed, which I thought odd, so maybe I'm doing something wrong.

Aymen Frikha (aym-frikha) wrote :

Hi Andreas,

Thank you very much for the package provided. It works fine for me.
You can reproduce the bug when you deploy hacluster charm as a subordinate service for percona cluster. It will deploy Pacemaker, put the vip and the resource agent to manage Percona cluster.
It will use galera resource agent to controle percona database because they use same clustering mechanism.
The hacluster charm uses packages from xenial main repository. Is it possible to backport the patch to it ?


Andreas Hasenack (ahasenack) wrote :

Sure, I just need to get the test case in order. I'll try that today.

Andreas Hasenack (ahasenack) wrote :

This wasn't enough to trigger the bug:

juju bootstrap lxd
juju deploy -n 3 cs:xenial/percona-cluster
juju config percona-cluster vip=<someipIhave> min-cluster-size=3
juju deploy hacluster
juju add-relation percona-cluster hacluster
do a lxc stop -f <all percona units>
(wait for juju status to notice)
do a lxc start <all percona units>
(wait for juju status to become green)

It was my understanding that this should trigger the original bug, but maybe it's racy or needs active database writes do be happening at the moment of the shutdown.

I'll try again now with your charm, the one that uses resource-agents and fails without an update to resource-agents. But if you have another easy way to reproduce the bug, please do tell.

Aymen Frikha (aym-frikha) wrote :

Yes you are right Andreas, you need active database writes when you shutdown them. The resource agent automatically detect which instance has the last commit and start it as master and resume the replication to the other instances.

Andreas Hasenack (ahasenack) wrote :

By doing graceful shutdowns I can get in a state where the last node to die will have "safe_to_bootstrap:1" in its grastate.dat file. But I couldn't get that node back running, which was odd, as it should be the *only* one that can be started. I had to use one of the other initscript targets, restart-bootstrap, instead of just restart, or else it would timeout trying to reach the "juju cluster":

2018-11-09 18:54:58 14147 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1478: Failed to open channel 'juju_cluster' at 'gcomm://,': -110 (Connection timed out)

I see two options here (at least):
a) we backport just what was called the workaround bit, since you say this is what you have been using for a long time now. That is the bit that handles the case where all nodes crashed, and thus "safe_to_bootstrap" is set to zero in all of them. Without the fix, in this case no node will be able to start up. The fix uses the same logic that has been always used to determine the right node to start before "safe_to_bootstrap" existed, and once it finds that node, it just flips that flag to 1 to allow the service to be started
b) we backport the full patch, which consiste of part (a) above, plus skipping the logic to find the right node to start if it finds "safe_to_bootstrap" set to 1. This one will need more testing.

Aymen Frikha (aym-frikha) wrote :

Can you test using this configuration of Pacemaker ? :

primitive p_percona ocf:heartbeat:galera \
        params wsrep_cluster_address="gcomm://controller-1,controller-2,controller-3" \
        params config="/etc/mysql/my.cnf" \
        params datadir="/var/lib/percona-xtradb-cluster" \
        params socket="/var/run/mysqld/mysqld.sock" pid="/var/run/mysqld/" \
        params check_user=root check_passwd=****** \
        params binary="/usr/bin/mysqld_safe" \
        op monitor timeout=120 interval=20 depth=0 \
        op monitor role=Master timeout=120 interval=10 depth=0 \
        op monitor role=Slave timeout=120 interval=30 depth=0
ms ms_percona p_percona \
        meta notify=true interleave=true \
        meta master-max=3 \
        meta ordered=true target-role=Started

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers