MySQL OCF RA action monitor must check if a seed node is running the most recent of known GTIDs
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Fuel for OpenStack |
Fix Committed
|
Medium
|
Bogdan Dobrelya | ||
7.0.x |
Fix Released
|
Critical
|
Denis Puchkin | ||
8.0.x |
Fix Released
|
Critical
|
Denis Puchkin | ||
Mitaka |
Fix Released
|
Critical
|
Sergii Golovatiuk |
Bug Description
This bug looks not easy to catch up.
I caught it only a couple of times while was running jepsen tests for few days.
Details:
When the seed (aka master) node was started a long time ago, and later the OCF RA reports "MySQL lost quorum or uninitialized" on majority of the rest DB nodes, it ends up with either a *very* long auto-recovery time, or fails to recover at all.
Only the seed node keeps running, even if it has an obsolete GTID, which is not the most recent across the rest of the nodes. This requires a manual recovery of the DB cluster nodes. For example, one may "nuke" all mysqld on the nodes and allow the OCF RA to pick the most recent node. This provides sad UX, although should not be a big deal.
Example snippet (4/5 nodes was affected): http://
To fix that, perhaps monitor must check if the current seed node (aka master) is running with a bad GTID, which is not the most recent across the nodes, and report failure.
tags: | added: area-library |
tags: | added: ct2 customer-found sla1 support |
tags: | added: on-verification |
Fix proposed to branch: master /review. openstack. org/318162
Review: https:/