2015-02-04 05:10:20 |
Stuart Bishop |
bug |
|
|
added bug |
2015-02-05 14:10:42 |
Curtis Hovey |
tags |
|
charms feature hooks |
|
2015-02-05 14:11:03 |
Curtis Hovey |
juju-core: status |
New |
Triaged |
|
2015-02-05 14:11:05 |
Curtis Hovey |
juju-core: importance |
Undecided |
Medium |
|
2015-02-05 14:22:11 |
Curtis Hovey |
tags |
charms feature hooks |
canonical-is charms feature hooks |
|
2015-02-12 17:33:09 |
Jorge Niedbalski |
tags |
canonical-is charms feature hooks |
canonical-is charms cts feature hooks |
|
2015-03-03 22:53:28 |
Curtis Hovey |
juju-core: milestone |
|
1.24-alpha1 |
|
2015-04-21 13:54:13 |
Curtis Hovey |
juju-core: milestone |
1.24-alpha1 |
|
|
2015-05-25 11:18:27 |
Stuart Bishop |
summary |
Impossible to cleanly remove a node from a cluster |
Impossible to cleanly remove a unit from a relation |
|
2015-05-25 11:47:48 |
Stuart Bishop |
description |
For charms needing to manage a clustered service (such as MongoDB, Cassandra, Redis, Swift) it is impossible to safely destroy a unit. The departing node on the doomed unit must be decommissioned to avoid potential data loss and manual repair of the cluster. Unfortunately, juju provides no suitable hooks to do this. The peer relation-departed and relation-broken hooks are supposed to support this use case, but do not.
The relation-departed hook cannot be used to decommission the departing node, because it is impossible to tell if the unit running the hook contains the doomed node or not. For example, in a 3 unit service (cassandra/0, cassandra/1, cassandra/2), if we drop unit 1 the following hooks are fired:
cassandra/0's peer relation-departed hook with $REMOTE_UNIT==cassandra/1
cassandra/1's peer relation-departed hook with $REMOTE_UNIT==cassandra/0
cassandra/1's peer relation-departed hook with $REMOTE_UNIT==cassandra/2
cassandra/2's peer relation-departed hook with $REMOTE_UNIT==cassandra/1
When any of these hooks are run, there is not enough context to tell if it is the local unit or $REMOTE_UNIT that is the unit being destroyed. The hooks cannot tell which node needs to be decommissioned and safely removed from the cluster.
The relation-broken hook cannot be used to decommission the departing node either, as it is run after the relation-departed hooks. The relation-departed hooks are responsible for revoking access from departing units, so the relation-broken hook cannot safely remove its node from the cluster as by this point the rest of the cluster is refusing to talk to it.
Without new features, I think charm authors are forced to use one of the following work arounds:
- Require the operator to manually decomission nodes before dropping a unit
- Require the operator to manually repair the cluster after dropping a unit
- Keep access open to departing units indefinitely and decommission the node in relation-broken, rather than have relation-departed keep the cluster secure.
For a fix, I think we require a new hook that is run on the departing unit before the relation-departed hooks are fired. Because relation-departed is the only point units can revoke access from the doomed unit, decommissioning must happen before then. If decommissioning is attempted in relation-departed or relation-broken, access rights will likely have already been removed by some or all of the remaining units. |
A relation-departed hook cannot be used by a charm to perform cleanup, as the remote service may have already run its relation-departed hook and revoked access. From the documentation, "this should be used to remove all references to the remote unit, because there's no guarantee that it's still part of the system".
The situation is worse for a peer relation. In addition to the above catch-22, the unit running the relation-departed hook has no idea if it is the unit leaving the service or if it is the remote unit leaving the service.
So as a concrete example, it is impossible for the Cassandra charm to automatically decommission a node before it is removed. The peer-relation-departed hook cannot decommission the node because the charm has no idea which unit is actually being dropped. And even if it did, the decommissioning process would fail as it takes time and the other units in the cluster will have revoked its access before it completed. Instead, the operator is required to manually decommission nodes before dropping the unit. Failing to do this requires lengthy cleanup operations, and data stored at replication factor 1 will be lost.
Before the relation-departed hooks are run, another hook needs to be run on the departing unit to provide it with the opportunity it needs. relation-departing seems the obvious choice. |
|
2015-08-06 13:26:50 |
Edward Hope-Morley |
tags |
canonical-is charms cts feature hooks |
canonical-is charms feature hooks sts |
|
2015-11-09 14:11:10 |
Mario Splivalo |
bug |
|
|
added subscriber Mario Splivalo |
2016-02-08 19:43:08 |
Jorge Niedbalski |
tags |
canonical-is charms feature hooks sts |
canonical-is charms feature hooks sts sts-needs-review |
|
2016-07-12 11:27:38 |
Stuart Bishop |
bug |
|
|
added subscriber The Canonical Sysadmins |
2016-08-11 14:32:02 |
Jorge Niedbalski |
tags |
canonical-is charms feature hooks sts sts-needs-review |
canonical-is charms feature hooks sts-rfe |
|
2016-08-11 15:22:28 |
Jorge Niedbalski |
tags |
canonical-is charms feature hooks sts-rfe |
canonical-is charms feature hooks sts sts-rfe |
|
2016-08-11 15:32:19 |
Jorge Niedbalski |
summary |
Impossible to cleanly remove a unit from a relation |
[RFE] Impossible to cleanly remove a unit from a relation |
|
2016-10-17 13:17:37 |
Anastasia |
juju-core: status |
Triaged |
Won't Fix |
|
2016-10-20 12:36:51 |
Anastasia |
bug task added |
|
juju |
|
2016-10-20 12:37:00 |
Anastasia |
juju: status |
New |
Triaged |
|
2016-10-20 12:37:05 |
Anastasia |
juju: importance |
Undecided |
Wishlist |
|
2017-03-27 06:07:57 |
Ian Booth |
juju: milestone |
|
2.2-beta3 |
|
2017-03-27 06:08:04 |
Ian Booth |
juju: importance |
Wishlist |
High |
|
2017-04-28 15:27:35 |
Canonical Juju QA Bot |
juju: milestone |
2.2-beta3 |
2.2-beta4 |
|
2017-05-11 18:22:59 |
Canonical Juju QA Bot |
juju: milestone |
2.2-beta4 |
2.2-rc1 |
|
2017-05-31 02:00:58 |
Tim Penhey |
juju: milestone |
2.2-rc1 |
|
|
2018-03-27 21:00:06 |
Dmitrii Shcherbakov |
bug |
|
|
added subscriber Dmitrii Shcherbakov |
2018-03-27 21:05:11 |
Dmitrii Shcherbakov |
bug watch added |
|
https://github.com/juju/docs/issues/2357 |
|
2019-11-27 08:46:06 |
Sandor Zeestraten |
bug |
|
|
added subscriber Sandor Zeestraten |
2020-03-24 15:22:11 |
Achilleas Anagnostopoulos |
juju: assignee |
|
Achilleas Anagnostopoulos (achilleasa) |
|
2020-03-24 15:22:18 |
Achilleas Anagnostopoulos |
juju: milestone |
|
2.8-beta1 |
|
2020-03-24 15:22:26 |
Achilleas Anagnostopoulos |
juju: status |
Triaged |
In Progress |
|
2020-03-30 15:24:14 |
Achilleas Anagnostopoulos |
juju: status |
In Progress |
Fix Committed |
|
2020-03-31 11:22:57 |
Dominique Poulain |
bug |
|
|
added subscriber Dominique Poulain |
2020-04-28 12:15:51 |
Vladimir Grevtsev |
bug |
|
|
added subscriber Vladimir Grevtsev |
2020-06-04 00:41:06 |
Harry Pidcock |
juju: status |
Fix Committed |
Fix Released |
|