Provide an action to recover from a majority failure

Bug #1842332 reported by Andrea Ieri
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Etcd Charm
Opinion
Medium
Justin Clark

Bug Description

An HA ETCD cluster can normally be scaled down to a single node by simply removing extra units. However, if the majority of the units needs to be force removed, relation departed hooks will not have a chance to run and the surviving unit(s) will not accept new cluster members.
In order to recover from this situation, the etcd cluster has to be restarted once with the force-new-cluster option set to true. This should be wrapped in an action.

Example: let's assume we have a 3-node ETCD cluster where etcd/0 is functional, while etcd/1 and etcd/2 are unrecoverable. In order to bring the cluster back to health, an operator needs to do the following:

1. juju remove-unit --force etcd/1
2. juju remove-unit --force etcd/2
3. vim /var/snap/etcd/common/etcd.conf.yml # set 'force-new-cluster:' to true
4. service snap.etcd.etcd restart
5. vim /var/snap/etcd/common/etcd.conf.yml # set 'force-new-cluster:' to false
6. juju add-unit -n2 etcd

Lines 3 to 6 should be performed by an action.

George Kraft (cynerva)
Changed in charm-etcd:
importance: Undecided → Medium
status: New → Triaged
Changed in charm-etcd:
assignee: nobody → Justin Clark (justinclark)
status: Triaged → In Progress
Revision history for this message
Adam Dyess (addyess) wrote :

Adding a link to the PR which was started to address this
https://github.com/charmed-kubernetes/layer-etcd/pull/177

Changed in charm-etcd:
milestone: none → 1.28
Adam Dyess (addyess)
Changed in charm-etcd:
milestone: 1.28 → 1.28+ck1
Adam Dyess (addyess)
Changed in charm-etcd:
milestone: 1.28+ck1 → 1.29
Changed in charm-etcd:
milestone: 1.29 → 1.29+ck1
tags: added: backport-needed
Changed in charm-etcd:
milestone: 1.29+ck1 → 1.29+ck2
Adam Dyess (addyess)
Changed in charm-etcd:
milestone: 1.29+ck2 → 1.30
Revision history for this message
Adam Dyess (addyess) wrote :

There is an etcd rewrite into ops that could render this issue irrelevant.

Changed in charm-etcd:
milestone: 1.30 → none
status: In Progress → Opinion
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.