juju-core

[RFE] Impossible to cleanly remove a unit from a relation

Bug #1417874 reported by Stuart Bishop on 2015-02-04

This bug affects 10 people

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Fix Released	High	Achilleas Anagnostopoulos	Canonical Juju 2.8-beta1
	juju-core	Won't Fix	Medium	Unassigned

Bug Description

A relation-departed hook cannot be used by a charm to perform cleanup, as the remote service may have already run its relation-departed hook and revoked access. From the documentation, "this should be used to remove all references to the remote unit, because there's no guarantee that it's still part of the system".

The situation is worse for a peer relation. In addition to the above catch-22, the unit running the relation-departed hook has no idea if it is the unit leaving the service or if it is the remote unit leaving the service.

So as a concrete example, it is impossible for the Cassandra charm to automatically decommission a node before it is removed. The peer-relation-departed hook cannot decommission the node because the charm has no idea which unit is actually being dropped. And even if it did, the decommissioning process would fail as it takes time and the other units in the cluster will have revoked its access before it completed. Instead, the operator is required to manually decommission nodes before dropping the unit. Failing to do this requires lengthy cleanup operations, and data stored at replication factor 1 will be lost.

Before the relation-departed hooks are run, another hook needs to be run on the departing unit to provide it with the opportunity it needs. relation-departing seems the obvious choice.

See original description

Tags:

Curtis Hovey (sinzui) on 2015-02-05

tags:	added: charms feature hooks
Changed in juju-core:
status:	New → Triaged
importance:	Undecided → Medium

Curtis Hovey (sinzui) on 2015-02-05

tags:

added: canonical-is

Jorge Niedbalski (niedbalski) on 2015-02-12

tags:

added: cts

Curtis Hovey (sinzui) on 2015-03-03

Changed in juju-core:
milestone:	none → 1.24-alpha1

Curtis Hovey (sinzui) on 2015-04-21

Changed in juju-core:
milestone:	1.24-alpha1 → none

Stuart Bishop (stub) on 2015-05-25

summary:

- Impossible to cleanly remove a node from a cluster
+ Impossible to cleanly remove a unit from a relation

Stuart Bishop (stub) on 2015-05-25

description:

updated

Edward Hope-Morley (hopem) on 2015-08-06

tags:

added: sts
removed: cts

Revision history for this message

Mario Splivalo (mariosplivalo) wrote on 2015-11-09: Re: Impossible to cleanly remove a unit from a relation

Hi.

This is also issue for the percona-cluster charm which can't safely remove the unit - it should shut down mysql prior removing the unit but it can't do so as there is no way to tell on which endpoint of the relation the -departed hook is running (if it's on the unit to be removed then mysql should be stopped, but not on the other one!).

Introducing a hook that would fire before -departed, only on unit that is about to depart relation, would solve this issue.
(Here is the percona-cluster related bug: https://bugs.launchpad.net/charms/+source/percona-cluster/+bug/1514472)

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2015-11-24:

We're tracking this request for consideration in future development cycles.

Copying some discussion from the mailing list to the bug (from axw):
Comment from the sidelines: we have something similar in storage now, with "storage-detaching" hook. This runs before storage is detached, so that charms can stop using the storage before it's ripped out beneath them.

With that in mind, can we please call this "-relation-departing"?

Revision history for this message

Stuart Bishop (stub) wrote on 2015-12-04:

Should a network partition between the controller and the departing node block the unit's departure until the -relation-departing hook can be run? 'Yes' would be a good answer if 'destroy-unit --force' will force the issue and skip running the -relation-departing hook if necessary.

Jorge Niedbalski (niedbalski) on 2016-02-08

tags:

added: sts-needs-review

Jorge Niedbalski (niedbalski) on 2016-08-11

tags:

added: sts-rfe
removed: sts sts-needs-review

Jorge Niedbalski (niedbalski) on 2016-08-11

tags:

added: sts

Jorge Niedbalski (niedbalski) on 2016-08-11

summary:

- Impossible to cleanly remove a unit from a relation
+ [RFE] Impossible to cleanly remove a unit from a relation

Anastasia (anastasia-macmood) on 2016-10-17

Changed in juju-core:
status:	Triaged → Won't Fix

Revision history for this message

Mario Splivalo (mariosplivalo) wrote on 2016-10-20:

Hello, Anastasia.

Is there another mechanism inside juju that will allow for clean removing of a unit - something that would provide similar means as wanted '-relation-departing' hook?

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2016-10-20:

@Mario Splivalo,
I have closed this bug as "juju-core" project is only tracking Juju 1.x. Our current Juju 1 release is 1.25 and it is only open for Critical bugs.

I will add this as wish list item to our Juju 2 launchpad project!

Thank you for your feedback!

Changed in juju:
status:	New → Triaged
importance:	Undecided → Wishlist

Revision history for this message

Gabriel Samfira (gabriel-samfira) wrote on 2016-12-02:

This has creped up in one of our use cases as well. Essentially, any charm that deploys a cluster will have this issue at some point.

Just adding my +1 to this being resolved.

Revision history for this message

Marco Ceppi (marcoceppi) wrote on 2017-03-24:

I think this could be solved by simply having a new environment variable based to denote lifecycle. One possible solution is a `JUJU_UNIT_DYING` where only set when the unit is on it's way out. Another possibility is a `JUJU_UNIT_LIFECYCLE` environment variable which has an "alive" and "dying" value where the opportunity to add more LIFECYCLE labels

Revision history for this message

Stuart Bishop (stub) wrote on 2017-03-25:

An environment variable detailing the lifecycle would solve a related issue (the second paragraph in the original bug report), and would certainly help or solve many situations.

However, this particular issue is that when a unit's -departed hooks are run it may find that the related units have already departed the relation and have cut off all access. A Cassandra node has no opportunity to decommission itself cleanly, because it is no longer able to communicate with the rest of the cluster and migrate its data to the remaining nodes. Its particularly important if data is being stored without redundancy (replication factor == 1), because in this case the data is lost. Without the extra hook, removing a node from a cluster is the same as a failure and requires the cluster to be repaired. With the extra hook, the node may decommission itself cleanly and we don't need to repair the cluster and we never have a period of time with reduced data redundancy.

Ian Booth (wallyworld) on 2017-03-27

Changed in juju:
milestone:	none → 2.2-beta3
importance:	Wishlist → High

Revision history for this message

John A Meinel (jameinel) wrote on 2017-04-10:

#10

this feels like bug #1417874

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2017-04-11:

#11

@John,
It certainly is :D

Canonical Juju QA Bot (juju-qa-bot) on 2017-04-28

Changed in juju:
milestone:	2.2-beta3 → 2.2-beta4

Canonical Juju QA Bot (juju-qa-bot) on 2017-05-11

Changed in juju:
milestone:	2.2-beta4 → 2.2-rc1

Revision history for this message

Tim Penhey (thumper) wrote on 2017-05-31:

#12

To be honest, I don't think adding synchronisation of hooks across units is something we are likely to add, and this is what is really needed for the clean removal of a clustered unit.

Instead, I think this would be much more suited to an action. One that explicitly removes one unit from the cluster, and that this action should be run and completed before the removal of the unit.

Changed in juju:
milestone:	2.2-rc1 → none

Revision history for this message

Mario Splivalo (mariosplivalo) wrote on 2017-10-05:

#13

The problem would still remain: when unit is being removed, juju fires the -departed hook, but we don't know, when inside the hook, is the hook running on the unit that's being removed, or on the unit that's remaining.

If you have two units deployed, say, percona-cluster. When you remove one of the units (say, percona-cluster/0) here is what happens:

1a. -relation-departed hook is run in percona-cluster/0
1b. -relation-departed hook is run in percona-cluster/1

2.-relation-broken hook is run in percona-cluster/0

Now, this creates the issue. When -relation-departed hook is run on both units there is no way for the hook code to know if it's running on the unit that's parting or on the unit that's remaining. Therefore it can't politely stop mysqld service. Currently if you just 'juju remove-unit', and get rid of percona-cluster/1, percona-cluster/0's mysqld will switch to 'degraded' state and will not allow querying data.

A "-relation-departing" hook would solve that issue (or the environment variable, or any other mechanism that would allow hook that is being run to know if it's run on departing unit or not).

(In this particular example with percona-cluster the workaround is quite simple - operator would, prior issuing 'juju remove-unit', ssh into the unit that is to be removed and manually issue a clean shutdown of mysql service. After mysql stops there operator can utilize 'juju remove-unit' to get rid of the unit completely. However, having a separate hook would, imho, greatly simplify unit removal).

Revision history for this message

Dmitrii Shcherbakov (dmitriis) wrote on 2018-03-27:

#14

Haven't noticed this bug. Here is the doc bug I created about similar matters https://github.com/juju/docs/issues/2357

Achilleas Anagnostopoulos (achilleasa) on 2020-03-24

Changed in juju:
assignee:	nobody → Achilleas Anagnostopoulos (achilleasa)
milestone:	none → 2.8-beta1
status:	Triaged → In Progress

Revision history for this message

Achilleas Anagnostopoulos (achilleasa) wrote on 2020-03-26:

#15

PR https://github.com/juju/juju/pull/11356 targets the develop branch and exposes the departing unit when invoking xyz-relation-departed hooks via a new envvar called JUJU_DEPARTING_UNIT. Charms can compare this value to JUJU_REMOTE_UNIT to see if they are the ones going away.