MariaDB lights out recovery

Bug #1558399 reported by OpenStack Infra on 2016-03-17
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Hui Kang

Bug Description
Dear bug triager. This bug was created since a commit was marked with DOCIMPACT.
Your project "openstack/kolla" is set up so that we directly report the documentation bugs against it. If this needs changing, the docimpact-group option needs to be added for the project. You can ask the OpenStack infra team (#openstack-infra on freenode) for help if you need to.

commit 2aaaed770e158c3dc0d6e1895021c0e0475d200d
Author: SamYaple <email address hidden>
Date: Mon Feb 29 15:02:15 2016 +0000

    MariaDB lights out recovery

    This playbook only matters for multinode since AIO can recover from
    power outage without additional configuration.

    Implements: blueprint mariadb-lights-out
    Change-Id: I903c3bcd069af39814bcabcef37684b1f043391f

Tags: doc Edit Tag help
Steven Dake (sdake) on 2016-03-17
Changed in kolla:
importance: Undecided → Wishlist
milestone: none → mitaka-rc2
status: New → Confirmed
Hui Kang (huikang27) on 2016-03-17
Changed in kolla:
assignee: nobody → Hui Kang (huikang27)
Sam Yaple (s8m) wrote :

Hui you are welcome to take this on,. Ask me if you have any question, but this does need to merge before rc2.

Hui Kang (huikang27) wrote :

Hi, Sam.
Some question about trying out the mariadb recovery playbook. I deployed a multi-node kolla with one control node, one network node, and one compute node. Then I use "docker rm -f mariadb" to remove the mariadb and run "mariadb_recovery.yml". The playbook failed at

"TASK: [mariadb | fail ] *******************************************************
skipping: []

TASK: [mariadb | Checking if and mariadb containers are running] **************
failed: [] => {"failed": true}
msg: No such container: mariadb

FATAL: all hosts have already failed -- aborting
Is there anything wrong with my setup? Thanks.

- Hui

Sam Yaple (s8m) wrote :

The lights out recovery is for power outages, not container removals and recreations.

The container itself should still be around for the script to run.

Hui Kang (huikang27) wrote :

Hi, Sam. So I can simulate the power outage by "shutdown now". And then I run the recovery script to recover the database. Is my understanding correct? Thanks. - Hui

Steven Dake (sdake) wrote :

Please fix now, but moving to newton since docs are not versioned. as such we wont backport the documentation.

Changed in kolla:
milestone: mitaka-rc2 → newton-1
Steven Dake (sdake) on 2016-03-22
tags: added: docimpact
removed: doc kolla
Steven Dake (sdake) on 2016-06-23
Changed in kolla:
milestone: newton-1 → newton-2
Changed in kolla:
milestone: newton-2 → newton-3
tags: added: doc
removed: docimpact
Changed in kolla:
milestone: newton-3 → newton-rc1
Changed in kolla:
milestone: newton-rc1 → occata-1
Sam Yaple (s8m) wrote :

#kolla.2015-10-17.log-00:36 < kfox1111> yeah. specifically, what the procedure is for galera to recover it from power failure.
#kolla.2015-10-17.log-00:36 < SamYaple> that is a bit tricky because thats dependant on galera
#kolla.2015-10-17.log-00:36 < kfox1111> the docs for it mention doing some games finding the last written server and bringing that one up first with special args, then adding the rest.
#kolla.2015-10-17.log-00:36 < kfox1111> but I'm not sure how that will work with the containers.
#kolla.2015-10-17.log:00:36 < SamYaple> so the official way if you are unaware is to find /var/lib/mysql/grastate.dat with the highest revision
#kolla.2015-10-17.log-00:37 < SamYaple> but when it crashes that is sometimes -1
#kolla.2015-10-17.log-00:37 < SamYaple> but basically you have to pick a node to start the cluster again with
#kolla.2015-10-17.log-00:37 < SamYaple> ideally the llast node to shutdown
#kolla.2015-10-17.log-00:37 < SamYaple> this cant be done automatically, so it will be in the deployers responsibilities to do this
#kolla.2015-10-17.log-00:37 < kfox1111> I'm guessing with powerfailure, they will be basically the same.
#kolla.2015-10-17.log-00:38 < SamYaple> maybe maybe not, what if one was down ahead of time anyway
#kolla.2015-10-17.log-00:38 < kfox1111> but how do you start it back up with the containers? do you tweak a config file and docker start it back up, or do you use ansible?
#kolla.2015-10-17.log-00:38 < kfox1111> ah. true.

Sam Yaple (s8m) wrote :

The issue here is that galera recovery can't be 100% automated. It requires some knowledge of galera and the state of the cluster.

With a power failure if this occurs:
  * all nodes are -1
  * all nodes were running prior to powerfailure
  * you are not a database expert which can stitch together a database from backups

The answer is to pick one at random (preferably the last one to receive writes which is, by default, the first galera node, mariadb[0]). That is what the current playbooks do.

Tweaking them to register the highest value from grastate.dat is a mistake since you could stop a node safely, and then have a power outage in the future. that _old_ nodes will win with the highest revision number.

Changed in kolla:
milestone: ocata-1 → ocata-2
Changed in kolla:
milestone: ocata-2 → ocata-3
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers