MariaDB lights out recovery

Bug #1558399 reported by OpenStack Infra
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
kolla
Fix Released
Wishlist
Hui Kang

Bug Description

https://review.openstack.org/293161
Dear bug triager. This bug was created since a commit was marked with DOCIMPACT.
Your project "openstack/kolla" is set up so that we directly report the documentation bugs against it. If this needs changing, the docimpact-group option needs to be added for the project. You can ask the OpenStack infra team (#openstack-infra on freenode) for help if you need to.

commit 2aaaed770e158c3dc0d6e1895021c0e0475d200d
Author: SamYaple <email address hidden>
Date: Mon Feb 29 15:02:15 2016 +0000

    MariaDB lights out recovery

    This playbook only matters for multinode since AIO can recover from
    power outage without additional configuration.

    DocImpact
    Implements: blueprint mariadb-lights-out
    Change-Id: I903c3bcd069af39814bcabcef37684b1f043391f

Tags: doc
Steven Dake (sdake)
Changed in kolla:
importance: Undecided → Wishlist
milestone: none → mitaka-rc2
status: New → Confirmed
Hui Kang (huikang27)
Changed in kolla:
assignee: nobody → Hui Kang (huikang27)
Revision history for this message
Sam Yaple (s8m) wrote :

Hui you are welcome to take this on,. Ask me if you have any question, but this does need to merge before rc2.

Revision history for this message
Hui Kang (huikang27) wrote :

Hi, Sam.
Some question about trying out the mariadb recovery playbook. I deployed a multi-node kolla with one control node, one network node, and one compute node. Then I use "docker rm -f mariadb" to remove the mariadb and run "mariadb_recovery.yml". The playbook failed at

"TASK: [mariadb | fail ] *******************************************************
skipping: [9.2.212.54]

TASK: [mariadb | Checking if and mariadb containers are running] **************
failed: [172.16.1.101] => {"failed": true}
msg: No such container: mariadb

FATAL: all hosts have already failed -- aborting
"
Is there anything wrong with my setup? Thanks.

- Hui

Revision history for this message
Sam Yaple (s8m) wrote :

The lights out recovery is for power outages, not container removals and recreations.

The container itself should still be around for the script to run.

Revision history for this message
Hui Kang (huikang27) wrote :

Hi, Sam. So I can simulate the power outage by "shutdown now". And then I run the recovery script to recover the database. Is my understanding correct? Thanks. - Hui

Revision history for this message
Steven Dake (sdake) wrote :

Please fix now, but moving to newton since docs are not versioned. as such we wont backport the documentation.

Changed in kolla:
milestone: mitaka-rc2 → newton-1
Steven Dake (sdake)
tags: added: docimpact
removed: doc kolla
Steven Dake (sdake)
Changed in kolla:
milestone: newton-1 → newton-2
Changed in kolla:
milestone: newton-2 → newton-3
tags: added: doc
removed: docimpact
Changed in kolla:
milestone: newton-3 → newton-rc1
Changed in kolla:
milestone: newton-rc1 → occata-1
Revision history for this message
Sam Yaple (s8m) wrote :

#kolla.2015-10-17.log-00:36 < kfox1111> yeah. specifically, what the procedure is for galera to recover it from power failure.
#kolla.2015-10-17.log-00:36 < SamYaple> that is a bit tricky because thats dependant on galera
#kolla.2015-10-17.log-00:36 < kfox1111> the docs for it mention doing some games finding the last written server and bringing that one up first with special args, then adding the rest.
#kolla.2015-10-17.log-00:36 < kfox1111> but I'm not sure how that will work with the containers.
#kolla.2015-10-17.log:00:36 < SamYaple> so the official way if you are unaware is to find /var/lib/mysql/grastate.dat with the highest revision
#kolla.2015-10-17.log-00:37 < SamYaple> but when it crashes that is sometimes -1
#kolla.2015-10-17.log-00:37 < SamYaple> but basically you have to pick a node to start the cluster again with
#kolla.2015-10-17.log-00:37 < SamYaple> ideally the llast node to shutdown
#kolla.2015-10-17.log-00:37 < SamYaple> this cant be done automatically, so it will be in the deployers responsibilities to do this
#kolla.2015-10-17.log-00:37 < kfox1111> I'm guessing with powerfailure, they will be basically the same.
#kolla.2015-10-17.log-00:38 < SamYaple> maybe maybe not, what if one was down ahead of time anyway
#kolla.2015-10-17.log-00:38 < kfox1111> but how do you start it back up with the containers? do you tweak a config file and docker start it back up, or do you use ansible?
#kolla.2015-10-17.log-00:38 < kfox1111> ah. true.

Revision history for this message
Sam Yaple (s8m) wrote :

The issue here is that galera recovery can't be 100% automated. It requires some knowledge of galera and the state of the cluster.

With a power failure if this occurs:
  * all nodes are -1
  * all nodes were running prior to powerfailure
  * you are not a database expert which can stitch together a database from backups

The answer is to pick one at random (preferably the last one to receive writes which is, by default, the first galera node, mariadb[0]). That is what the current playbooks do.

Tweaking them to register the highest value from grastate.dat is a mistake since you could stop a node safely, and then have a power outage in the future. that _old_ nodes will win with the highest revision number.

Changed in kolla:
milestone: ocata-1 → ocata-2
Changed in kolla:
milestone: ocata-2 → ocata-3
Mark Goddard (mgoddard)
Changed in kolla:
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.