Fuel for OpenStack

A galera prim node can't be started by OCF RA because there is another node running thinking the prim is OK

Bug #1595911 reported by Bogdan Dobrelya on 2016-06-24

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Confirmed	Medium	Fuel Sustaining	Fuel for OpenStack 10.0

Bug Description

If there is no a prim Galera node ready, other resource instances waiting for a new prim shall fail if have managed to start by a chance. Otherwise, the prim fails to start by Galera OCF RA design. A reelection checker reports false when another resource instances found running in the same pacemaker partition that has a quorum.

In this bug, a node waiting for a prim *was* kept started w/o a prim node ready, and the OCF RA have been reporting - mistakenly - that the prim is OK.

Steps to reproduce were given in the Galera reliability testing https://goo.gl/VHyIIE paper. Briefly: deploy a 5 nodes galera cluster, run the given Jepsen cases to verify its self-heal capabilities.

It may be reproduced, although it is a rare corner case hard to catch, on a Fuel env as well, given that the node-1, node-2, node-3 deployed as controller nodes:
1) https://github.com/bogdando/jepsen/tree/fuel/noop , see "How-to run tests from the Fuel master..."
2) PURGE=true ./vagrant_script/lein_test.sh noop ssh-test
3) docker exec -it jepsen bash -c "TESTPROC=mysqld lein test :only jepsen.noop-test/factors-netpart-test"

Tags:

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-06-24:

I hope that is a rare corner case, when multiple network partitions apply in a raw. Hence, a medium bug

Changed in fuel:
importance:	Undecided → Medium
milestone:	none → 10.0
assignee:	nobody → Fuel Sustaining (fuel-sustaining-team)
tags:	added: galera

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-06-24:

logs.tgz Edit (966.1 KiB, application/x-tar)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-06-24:

A short explanation:

- the node n1 was choosen by others as a prim as it has the most recent GTID (by OCF RA logic), so they wait for a prim to join it, as usual in a start -> timed out -> stop loop.
- but the n1 prim can't be started because the n3 has managed to sync SST, then start successfully
- so, the n1 tries to start in a normal join mode instead (w/o --wsrep-new-cluster) and fails as there is running n3 and it is not a prim
- we have a "deadlock" race condition ended up with only 1/5 DB nodes available but with a not recent GTID.

w/a - kill mysqld at the n3, this allows n1 to start as a prim, and the cluster to recover eventually to the most recent GTID that n1 has.

Dmitry Klenov (dklenov) on 2016-06-27