kolla

MariaDB lights out recovery

Bug #1558399 reported by OpenStack Infra on 2016-03-17

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	kolla	Fix Released	Wishlist	Hui Kang	kolla ocata-3 "o3"

Bug Description

https://review.openstack.org/293161
Dear bug triager. This bug was created since a commit was marked with DOCIMPACT.
Your project "openstack/kolla" is set up so that we directly report the documentation bugs against it. If this needs changing, the docimpact-group option needs to be added for the project. You can ask the OpenStack infra team (#openstack-infra on freenode) for help if you need to.

commit 2aaaed770e158c3dc0d6e1895021c0e0475d200d
Author: SamYaple <email address hidden>
Date: Mon Feb 29 15:02:15 2016 +0000

MariaDB lights out recovery

This playbook only matters for multinode since AIO can recover from
power outage without additional configuration.

    DocImpact
    Implements: blueprint mariadb-lights-out
    Change-Id: I903c3bcd069af39814bcabcef37684b1f043391f

Tags:

Steven Dake (sdake) on 2016-03-17

Changed in kolla:
importance:	Undecided → Wishlist
milestone:	none → mitaka-rc2
status:	New → Confirmed

Hui Kang (huikang27) on 2016-03-17

Changed in kolla:
assignee:	nobody → Hui Kang (huikang27)

Revision history for this message

Sam Yaple (s8m) wrote on 2016-03-18:

Hui you are welcome to take this on,. Ask me if you have any question, but this does need to merge before rc2.

Revision history for this message

Hui Kang (huikang27) wrote on 2016-03-19:

Hi, Sam.
Some question about trying out the mariadb recovery playbook. I deployed a multi-node kolla with one control node, one network node, and one compute node. Then I use "docker rm -f mariadb" to remove the mariadb and run "mariadb_recovery.yml". The playbook failed at

"TASK: [mariadb | fail ] *******************************************************
skipping: [9.2.212.54]

TASK: [mariadb | Checking if and mariadb containers are running] **************
failed: [172.16.1.101] => {"failed": true}
msg: No such container: mariadb

FATAL: all hosts have already failed -- aborting
"
Is there anything wrong with my setup? Thanks.

- Hui

Revision history for this message

Sam Yaple (s8m) wrote on 2016-03-21:

The lights out recovery is for power outages, not container removals and recreations.

The container itself should still be around for the script to run.

Revision history for this message

Hui Kang (huikang27) wrote on 2016-03-22:

Hi, Sam. So I can simulate the power outage by "shutdown now". And then I run the recovery script to recover the database. Is my understanding correct? Thanks. - Hui

Revision history for this message

Steven Dake (sdake) wrote on 2016-03-22:

Please fix now, but moving to newton since docs are not versioned. as such we wont backport the documentation.

Changed in kolla:
milestone:	mitaka-rc2 → newton-1

Steven Dake (sdake) on 2016-03-22

tags:

added: docimpact
removed: doc kolla

Steven Dake (sdake) on 2016-06-23

Changed in kolla:
milestone:	newton-1 → newton-2

Swapnil Kulkarni (coolsvap-deactivatedaccount) on 2016-07-20

Changed in kolla:
milestone:	newton-2 → newton-3

Swapnil Kulkarni (coolsvap-deactivatedaccount) on 2016-08-31

tags:

added: doc
removed: docimpact

Swapnil Kulkarni (coolsvap-deactivatedaccount) on 2016-08-31

Changed in kolla:
milestone:	newton-3 → newton-rc1

Swapnil Kulkarni (coolsvap-deactivatedaccount) on 2016-09-01

Changed in kolla:
milestone:	newton-rc1 → occata-1

Revision history for this message

Sam Yaple (s8m) wrote on 2016-09-28:

#kolla.2015-10-17.log-00:36 < kfox1111> yeah. specifically, what the procedure is for galera to recover it from power failure.
#kolla.2015-10-17.log-00:36 < SamYaple> that is a bit tricky because thats dependant on galera
#kolla.2015-10-17.log-00:36 < kfox1111> the docs for it mention doing some games finding the last written server and bringing that one up first with special args, then adding the rest.
#kolla.2015-10-17.log-00:36 < kfox1111> but I'm not sure how that will work with the containers.
#kolla.2015-10-17.log:00:36 < SamYaple> so the official way if you are unaware is to find /var/lib/mysql/grastate.dat with the highest revision
#kolla.2015-10-17.log-00:37 < SamYaple> but when it crashes that is sometimes -1
#kolla.2015-10-17.log-00:37 < SamYaple> but basically you have to pick a node to start the cluster again with
#kolla.2015-10-17.log-00:37 < SamYaple> ideally the llast node to shutdown
#kolla.2015-10-17.log-00:37 < SamYaple> this cant be done automatically, so it will be in the deployers responsibilities to do this
#kolla.2015-10-17.log-00:37 < kfox1111> I'm guessing with powerfailure, they will be basically the same.
#kolla.2015-10-17.log-00:38 < SamYaple> maybe maybe not, what if one was down ahead of time anyway
#kolla.2015-10-17.log-00:38 < kfox1111> but how do you start it back up with the containers? do you tweak a config file and docker start it back up, or do you use ansible?
#kolla.2015-10-17.log-00:38 < kfox1111> ah. true.

Revision history for this message

Sam Yaple (s8m) wrote on 2016-09-28:

The issue here is that galera recovery can't be 100% automated. It requires some knowledge of galera and the state of the cluster.

With a power failure if this occurs:
  * all nodes are -1
  * all nodes were running prior to powerfailure
  * you are not a database expert which can stitch together a database from backups

The answer is to pick one at random (preferably the last one to receive writes which is, by default, the first galera node, mariadb[0]). That is what the current playbooks do.

Tweaking them to register the highest value from grastate.dat is a mistake since you could stop a node safely, and then have a power outage in the future. that _old_ nodes will win with the highest revision number.

Jeffrey Zhang (jeffrey4l) on 2016-11-18

Changed in kolla:
milestone:	ocata-1 → ocata-2

Jeffrey Zhang (jeffrey4l) on 2016-12-16

Changed in kolla:
milestone:	ocata-2 → ocata-3

Mark Goddard (mgoddard) on 2019-09-12

Changed in kolla:
status:	Confirmed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.