should *-broken *-departed hooks run when a unit goes AWOL?

Bug #1494782 reported by Charles Butler
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
juju-core
Won't Fix
Medium
Unassigned

Bug Description

WHen deploying a cluster, and unit(s) of that cluster go AWOL from Juju, it seems like we should be doing what we can to assist that service in not having issues with config-routing.

To test I did the following:

juju deploy cs:~kubernetes/trusty/etcd
juju add-unit -n 2 etcd

once the cluster settled, I went into the cloud provider terminal and terminated an instance. The state server received an EOF from the unit agent, so it received notice that the unit was entering a "down" state.

machine-0: 2015-09-11 14:29:38 WARNING juju.worker.instanceupdater updater.go:248 cannot get instance info for instance "i-87b33624": instances not found
machine-0: 2015-09-11 14:29:53 ERROR juju.worker runner.go:223 exited "instancepoller": machine 6 not found

However, the etcd configuration is now potentially broken (in reality its not, it does raft routing and reconfigures itself to no longer use that node)

However in instances where we are determining leader/follower cases - this can be potentially problematic as the units were not notified to reconfigure.

What I expected to happen was see the cluster-relation-departed, cluster-relation-broken hooks run on the remaining units in the cluster.

Aaron Bentley (abentley)
Changed in juju-core:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
William Reade (fwereade) wrote :

You *certainly* shouldn't see "-broken" hooks running -- those mean "this whole relation is gone forever, please delete any associated config".

Running "-departed" hooks is more controversial. Pyjuju did this, and IMO it's actively harmful; a management-level glitch (e.g. someone stops jujud on the remote unit) should really not cause the managed services to reconfigure themselves as though that remote unit were gone forever.

"-up" and "-down" hooks have been mooted, but I'm a bit worried about them because they can clearly only ever be advisory -- mgmt failure does not imply workload failure, and I don't want to cascade non-failures through the whole system; and, similarly, workload failures can still occur when juju is perfectly happy. And, by exposing them, we imply that you should pay attention to them, and I fear it all ends up much more complex for very little benefit.

(this does not apply to workload-status-induced up/down -- I think that's a good idea, with some caveats -- but triggering off *agent* status is risky because it's just *pretending* to solve a problem)

Changed in juju-core:
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.