A galera prim node can't be started by OCF RA because there is another node running thinking the prim is OK

Bug #1595911 reported by Bogdan Dobrelya
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Confirmed
Medium
Fuel Sustaining

Bug Description

If there is no a prim Galera node ready, other resource instances waiting for a new prim shall fail if have managed to start by a chance. Otherwise, the prim fails to start by Galera OCF RA design. A reelection checker reports false when another resource instances found running in the same pacemaker partition that has a quorum.

In this bug, a node waiting for a prim *was* kept started w/o a prim node ready, and the OCF RA have been reporting - mistakenly - that the prim is OK.

Steps to reproduce were given in the Galera reliability testing https://goo.gl/VHyIIE paper. Briefly: deploy a 5 nodes galera cluster, run the given Jepsen cases to verify its self-heal capabilities.

It may be reproduced, although it is a rare corner case hard to catch, on a Fuel env as well, given that the node-1, node-2, node-3 deployed as controller nodes:
1) https://github.com/bogdando/jepsen/tree/fuel/noop , see "How-to run tests from the Fuel master..."
2) PURGE=true ./vagrant_script/lein_test.sh noop ssh-test
3) docker exec -it jepsen bash -c "TESTPROC=mysqld lein test :only jepsen.noop-test/factors-netpart-test"

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I hope that is a rare corner case, when multiple network partitions apply in a raw. Hence, a medium bug

Changed in fuel:
importance: Undecided → Medium
milestone: none → 10.0
assignee: nobody → Fuel Sustaining (fuel-sustaining-team)
tags: added: galera
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

A short explanation:

- the node n1 was choosen by others as a prim as it has the most recent GTID (by OCF RA logic), so they wait for a prim to join it, as usual in a start -> timed out -> stop loop.
- but the n1 prim can't be started because the n3 has managed to sync SST, then start successfully
- so, the n1 tries to start in a normal join mode instead (w/o --wsrep-new-cluster) and fails as there is running n3 and it is not a prim
- we have a "deadlock" race condition ended up with only 1/5 DB nodes available but with a not recent GTID.

w/a - kill mysqld at the n3, this allows n1 to start as a prim, and the cluster to recover eventually to the most recent GTID that n1 has.

Dmitry Klenov (dklenov)
tags: added: area-library
Changed in fuel:
status: New → Confirmed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.