should *-broken *-departed hooks run when a unit goes AWOL?
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
juju-core |
Won't Fix
|
Medium
|
Unassigned |
Bug Description
WHen deploying a cluster, and unit(s) of that cluster go AWOL from Juju, it seems like we should be doing what we can to assist that service in not having issues with config-routing.
To test I did the following:
juju deploy cs:~kubernetes/
juju add-unit -n 2 etcd
once the cluster settled, I went into the cloud provider terminal and terminated an instance. The state server received an EOF from the unit agent, so it received notice that the unit was entering a "down" state.
machine-0: 2015-09-11 14:29:38 WARNING juju.worker.
machine-0: 2015-09-11 14:29:53 ERROR juju.worker runner.go:223 exited "instancepoller": machine 6 not found
However, the etcd configuration is now potentially broken (in reality its not, it does raft routing and reconfigures itself to no longer use that node)
However in instances where we are determining leader/follower cases - this can be potentially problematic as the units were not notified to reconfigure.
What I expected to happen was see the cluster-
Changed in juju-core: | |
status: | New → Triaged |
importance: | Undecided → Medium |
Changed in juju-core: | |
status: | Triaged → Won't Fix |
You *certainly* shouldn't see "-broken" hooks running -- those mean "this whole relation is gone forever, please delete any associated config".
Running "-departed" hooks is more controversial. Pyjuju did this, and IMO it's actively harmful; a management-level glitch (e.g. someone stops jujud on the remote unit) should really not cause the managed services to reconfigure themselves as though that remote unit were gone forever.
"-up" and "-down" hooks have been mooted, but I'm a bit worried about them because they can clearly only ever be advisory -- mgmt failure does not imply workload failure, and I don't want to cascade non-failures through the whole system; and, similarly, workload failures can still occur when juju is perfectly happy. And, by exposing them, we imply that you should pay attention to them, and I fear it all ends up much more complex for very little benefit.
(this does not apply to workload- status- induced up/down -- I think that's a good idea, with some caveats -- but triggering off *agent* status is risky because it's just *pretending* to solve a problem)