Comment 3 for bug 1595911

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

A short explanation:

- the node n1 was choosen by others as a prim as it has the most recent GTID (by OCF RA logic), so they wait for a prim to join it, as usual in a start -> timed out -> stop loop.
- but the n1 prim can't be started because the n3 has managed to sync SST, then start successfully
- so, the n1 tries to start in a normal join mode instead (w/o --wsrep-new-cluster) and fails as there is running n3 and it is not a prim
- we have a "deadlock" race condition ended up with only 1/5 DB nodes available but with a not recent GTID.

w/a - kill mysqld at the n3, this allows n1 to start as a prim, and the cluster to recover eventually to the most recent GTID that n1 has.