juju-core

should -broken -departed hooks run when a unit goes AWOL?

Bug #1494782 reported by Charles Butler on 2015-09-11

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	juju-core	Won't Fix	Medium	Unassigned

Bug Description

WHen deploying a cluster, and unit(s) of that cluster go AWOL from Juju, it seems like we should be doing what we can to assist that service in not having issues with config-routing.

To test I did the following:

juju deploy cs:~kubernetes/trusty/etcd
juju add-unit -n 2 etcd

once the cluster settled, I went into the cloud provider terminal and terminated an instance. The state server received an EOF from the unit agent, so it received notice that the unit was entering a "down" state.

machine-0: 2015-09-11 14:29:38 WARNING juju.worker.instanceupdater updater.go:248 cannot get instance info for instance "i-87b33624": instances not found
machine-0: 2015-09-11 14:29:53 ERROR juju.worker runner.go:223 exited "instancepoller": machine 6 not found

However, the etcd configuration is now potentially broken (in reality its not, it does raft routing and reconfigures itself to no longer use that node)

However in instances where we are determining leader/follower cases - this can be potentially problematic as the units were not notified to reconfigure.

What I expected to happen was see the cluster-relation-departed, cluster-relation-broken hooks run on the remaining units in the cluster.

Aaron Bentley (abentley) on 2015-09-11

Changed in juju-core:
status:	New → Triaged
importance:	Undecided → Medium

Revision history for this message

William Reade (fwereade) wrote on 2015-09-17:

You *certainly* shouldn't see "-broken" hooks running -- those mean "this whole relation is gone forever, please delete any associated config".

Running "-departed" hooks is more controversial. Pyjuju did this, and IMO it's actively harmful; a management-level glitch (e.g. someone stops jujud on the remote unit) should really not cause the managed services to reconfigure themselves as though that remote unit were gone forever.

"-up" and "-down" hooks have been mooted, but I'm a bit worried about them because they can clearly only ever be advisory -- mgmt failure does not imply workload failure, and I don't want to cascade non-failures through the whole system; and, similarly, workload failures can still occur when juju is perfectly happy. And, by exposing them, we imply that you should pay attention to them, and I fear it all ends up much more complex for very little benefit.

(this does not apply to workload-status-induced up/down -- I think that's a good idea, with some caveats -- but triggering off *agent* status is risky because it's just *pretending* to solve a problem)

Anastasia (anastasia-macmood) on 2016-10-17

Changed in juju-core:
status:	Triaged → Won't Fix

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

juju-core

should *-broken *-departed hooks run when a unit goes AWOL?

Bug Description

Other bug subscribers

Remote bug watches

should -broken -departed hooks run when a unit goes AWOL?