kolla

mariadb_recovery fails and data loss

Series liberty
Bug #1627717

Bug #1627717 reported by bjolo on 2016-09-26

This bug affects 2 people

	Status	Importance	Assigned to	Milestone
kolla	Fix Released	Critical	Jeffrey Zhang	kolla newton-rc2 "rc2"
Liberty	Won't Fix	Critical	Jeffrey Zhang	kolla 1.1.3 "liberty"
Mitaka	Won't Fix	Critical	Jeffrey Zhang	kolla 2.0.4 "mitaka"

Bug Description

Test:
- Situation to mimic is that db nodes/containers have gone down one by one, and finally the last one goes down. During this gradual shutdown, writes to the database takes place. The end situation is that we need to do recovery where not all nodes are in sync.

Test setup:
- kolla master
- centos source built 20160926
- multinode

Test execution steps:
- make sure all nodes are in syn (show global status like 'wsrep%')
- shutdown mariadb on 2 nodes. (docker stop mariadb)
- create some users and verify that they exist in db. (openstack user create foo, user list)
- shutdown last mariadb node.
- kolla-ansible mariadb_recovery
- check if users exists

----------------------------------------------------------
inventory file (please note the order of hosts, since that matters.)
[control]
eselde02u32.mydomain.net
eselde02u33.mydomain.net
eselde02u34.mydomain.net

----------------------------------------------------------
Test Case 1.
- shutdown node 33 and 34; create users; shutdown node 32
Result
- all mariadb containers come back online and report they are in sync.
- playbook works. log http://paste.openstack.org/show/582926/
- no data is lost. i.e. the users exist in the database

Test Case 2.
- shutdown node 32 and 34; create users; shutdown node 33
result
- all mariadb containers come back online and report they are in sync.
- Playbook actually fails. log http://paste.openstack.org/show/582940/
- data lost. intermittent. 50% yes 50% no, of the runs

Test Case 3.
- shutdown node 32 and 33; create users; shutdown node 34
Result:
- All mariadb container come back online and report they are in sync.
- Playbook actually fails. log http://paste.openstack.org/show/582929/
- data lost. intermittent. 50% yes 50% no, of the runs

----------------------------------------
Conclusion:
I have only read through the code briefly, and it is probably more than one way of doing this, so this is purely speculative on my end. It seems that the code always attempts recovery on the first node in the inventory file, but according to galera documention it is imperativ that recovery must be done on the node that has the highest sequence number. This is why test case 1 works, but case 2 and 3 fails.

http://galeracluster.com/documentation-webpages/restarting

Revision history for this message

Jeffrey Zhang (jeffrey4l) wrote on 2016-09-26:

thanks bjolo, the recovery node is the root cause. I will try to fix this.

Changed in kolla:
assignee:	nobody → Jeffrey Zhang (jeffrey4l)

bjolo (bjorn-lofdahl) on 2016-09-28

summary:

- mariadb_recovery fails and intermittent data loss
+ mariadb_recovery fails and data loss

Steven Dake (sdake) on 2016-09-28

Changed in kolla:
status:	New → Confirmed
importance:	Undecided → Critical
milestone:	none → newton-rc2

Revision history for this message

Sam Yaple (s8m) wrote on 2016-09-28:

#6
#kolla.2015-10-17.log-00:36 < kfox1111> yeah. specifically, what the procedure is for galera to recover it from power failure.
#kolla.2015-10-17.log-00:36 < SamYaple> that is a bit tricky because thats dependant on galera
#kolla.2015-10-17.log-00:36 < kfox1111> the docs for it mention doing some games finding the last written server and bringing that one up first with special args, then adding the rest.
#kolla.2015-10-17.log-00:36 < kfox1111> but I'm not sure how that will work with the containers.
#kolla.2015-10-17.log:00:36 < SamYaple> so the official way if you are unaware is to find /var/lib/mysql/grastate.dat with the highest revision
#kolla.2015-10-17.log-00:37 < SamYaple> but when it crashes that is sometimes -1
#kolla.2015-10-17.log-00:37 < SamYaple> but basically you have to pick a node to start the cluster again with
#kolla.2015-10-17.log-00:37 < SamYaple> ideally the llast node to shutdown
#kolla.2015-10-17.log-00:37 < SamYaple> this cant be done automatically, so it will be in the deployers responsibilities to do this
#kolla.2015-10-17.log-00:37 < kfox1111> I'm guessing with powerfailure, they will be basically the same.
#kolla.2015-10-17.log-00:38 < SamYaple> maybe maybe not, what if one was down ahead of time anyway
#kolla.2015-10-17.log-00:38 < kfox1111> but how do you start it back up with the containers? do you tweak a config file and docker start it back up, or do you use ansible?
#kolla.2015-10-17.log-00:38 < kfox1111> ah. true.

Revision history for this message

Sam Yaple (s8m) wrote on 2016-09-28:

The issue here is that galera recovery can't be 100% automated. It requires some knowledge of galera and the state of the cluster.

With a power failure if this occurs:
  * all nodes are -1
  * all nodes were running prior to powerfailure
  * you are not a database expert which can stitch together a database from backups

The answer is to pick one at random (preferably the last one to receive writes which is, by default, the first galera node, mariadb[0]). That is what the current playbooks do.

Tweaking them to register the highest value from grastate.dat is a mistake since you could stop a node safely, and then have a power outage in the future. that _old_ nodes will win with the highest revision number.

Revision history for this message

bjolo (bjorn-lofdahl) wrote on 2016-09-28:

One issue is that the playbook actually fails in case 2 and 3 as it is right now.

In regards to Sams info, we have a few situations.

case 1:
- all nodes have different high sequence numbers. (no -1)
- playbook should do recovery on the node with highest number.

case 2:
- one node with high seq no. one or more with -1.
- all nodes with -1 have unknown seqno. they could be more advanced than the node with high seqno.
- state unknown. no way for kolla to know which one is the node with the latest data
solution:
- Prompt info to operator that highest seqno is node-1 but node-2 could could have higher (-1), please specify recovery node
- kolla-ansible mariadb_recovery master=node-x

case 3:
- all nodes have -1.
- recovery procedure just like case 2

Revision history for this message

Jeffrey Zhang (jeffrey4l) wrote on 2016-09-28:

Here is a article explain the case in detail.[0]

There 7 case.

1. Node A is gracefully stopped.
2. Nodes A and B are gracefully stopped
3. All three nodes are gracefully stopped.
4. Node A disappears from the cluster.
5. Nodes A and B disappear.
6. All nodes went down without proper shutdown procedure.
7. Cluster lost it’s primary state due to split brain situation.

mariadb+galera handle 1, 2 automatically.
`maraidb_recovery` handle case 3 ( this bug ) , we should choose a better bootstrap node based on grastate.dat file
case 4, 5 need re-deploy
case 7 is the most complicated, and we can not handle now.

[0] https://www.percona.com/blog/2014/09/01/galera-replication-how-to-recover-a-pxc-cluster/

Revision history for this message

Jeffrey Zhang (jeffrey4l) wrote on 2016-09-28:

case 3 and case 6 looks the same.

But case 6 will be recovery automatically with `pc.recovery` ( which is default ). And i do not think we need add `--new-cluster` parameter.

Revision history for this message

Steven Dake (sdake) wrote on 2016-10-08:

No action on this bug for about a week. Jeffrey any udpates? It sounds like a pretty complex problem to solve. I'm not sure how a database expert would even recover from this problem.

Note it isn't just Kolla that has this problem but every ODM.

Revision history for this message

Jeffrey Zhang (jeffrey4l) wrote on 2016-10-09:

Will push fix today.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-09: Fix proposed to kolla (master)

Fix proposed to branch: master
Review: https://review.openstack.org/384170

Changed in kolla:
status:	Confirmed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-09: Fix merged to kolla (master)

#10

Reviewed: https://review.openstack.org/384170
Committed: https://git.openstack.org/cgit/openstack/kolla/commit/?id=1bcb139392a5989ef0cc733e1c39d7b6cc87ccbe
Submitter: Jenkins
Branch: master

commit 1bcb139392a5989ef0cc733e1c39d7b6cc87ccbe
Author: Jeffrey Zhang <email address hidden>
Date: Sun Oct 9 10:50:06 2016 +0800

Choose node with largest seqno number for mariadb recovery

    When all mariadb nodes are stopped gracefully, mariadb galera will
    write it's last executed position into the grastate.dat file. Need find
    the node with largest seqno number in that file and recovery from that
    node.

Closes-Bug: #1627717
Change-Id: I6e97c190eec99c966bffde0698f783e519ba14bd

Changed in kolla:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-10: Fix proposed to kolla (stable/mitaka)

#11

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/384247

Revision history for this message

Mark Casey (mark-casey) wrote on 2016-10-10:

#12

I raised an issue on one of the reviews for the current fix that makes me want to defer this until after the pending release.

And then, another opinion on how to fix this:
Since combinations of scenarios get complicated and initial Operator troubleshooting may invalidate some of our assumptions before recovery, I'd advocate for having only 3 classes of db recovery (and not apply any galera auto-recovery that is custom to Kolla). I believe these work no matter the complexity of the scenario:

Scenario 1: Some set of nodes is still wsrep_cluster_status==Primary
Solution: Start/reboot failed nodes and let them IST/SST

Scenario 2: No set of nodes is still wsrep_cluster_status==Primary but pc.recovery should work
Solution: Start nodes and let pc.recovery work, then Operator applies Scenario 1 for any remaining failed nodes

^^If the nodes (db containers) coming online at roughly the same time is not enough for pc.recovery to operate, then this would require 'kolla-ansible reconfigure' (or perhaps 'kolla-ansible mariadb_recovery'?) to be able to oversee a cluster recovering via pc.recovery.

Scenario 3: No set of nodes is still wsrep_cluster_status==Primary and pc.recovery cannot work:
Solution: The operator must pass the desired sequence no. *AND* a list of nodes that are acceptable to bootstrap from. If and only if one (or several) of these acceptable nodes is at the targeted sequence, one node that meets the criteria is used to bootstrap.

kolla-ansible mariadb_recovery -e target_seqno=xxxxxx,allowed-for-bootstrap='node1,node3'

'mysqld --wsrep-recover' can find the seqno on a node that is -1 in grastate, but running that may run some innodb recovery stuff implicitly that the operator would need to warned about, so I'll look and see how to find it otherwise.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-13: Fix included in openstack/kolla 3.0.0.0rc2

#13

This issue was fixed in the openstack/kolla 3.0.0.0rc2 release candidate.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-04-17: Change abandoned on kolla (stable/mitaka)

#14

Change abandoned by Michal Jastrzebski (inc0) (<email address hidden>) on branch: stable/mitaka
Review: https://review.openstack.org/384247
Reason: mitaka is EOL

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.