Could we provide a guarantee that no unit of a given application will ever consider itself a leader until a previous leader has been deposed in apiserver's view? Likewise, an apiserver should not give any leader tokens until it receives a confirmation that the previous leader has been deposed and ran that hook. The latter condition is a strong requirement as if there is a network partition and a unit agent is no longer available, apiserver will never elect a new leader. If we introduce a timeout for that this may result in a split-brain unless a unit agent is required to stop executing further operations if there is a connection loss with the apiserver. We cannot just stop a hook execution because inherently a charm may spawn threads and processes on it's own will which may daemonize and do other arbitrary things on a system during hook execution. Any process tracking mechanisms are operating system-specific (e.g. cgroups) and they can be escaped so we shouldn't even look that way. The complicated part is that a unit <-> apiserver connection may be lost but a service-level network may be fine (i.e. the loss of L1-relevant connectivity doesn't mean services on L2 have the same picture) - this is the case where we have ToR and BoR switches providing service and management networks respectively on different physical media (switch fabrics). This is a common scenario for us (that's why we have network spaces). In other words: there may be an L1-related partition but not L2-related partition. I think that in this case a partitioned unit should run leader-deposed which may run L2-related checks to see if this is only the unit <-> apiserver connectivity problem. This is an interesting scenario as the unit agent is isolated in this case and cannot get anything from the apiserver (can't do facade RPC). However, I think this is a useful scenario to model. As an operator, would you do something like that with your system? Probably yes, you would go out-of-band or in-person and check if this problem impacts only Juju-related connectivity and decide upon service-level impact - this is what you should have in the charm in leader-deposed hook. === Now, to having one per-app leader unit running at a time, I believe this is, at least partially, present in Juju. https://github.com/juju/juju/blob/juju-2.3-rc1/worker/leadership/tracker.go#L206-L227 // setMinion arranges for lease acquisition when there's an opportunity. func (t *Tracker) setMinion() error { ... t.claimLease = make(chan struct{}) go func() { defer close(t.claimLease) logger.Debugf("%s waiting for %s leadership release", t.unitName, t.applicationName) err := t.claimer.BlockUntilLeadershipReleased(t.applicationName) if err != nil { logger.Debugf("error while %s waiting for %s leadership release: %v", t.unitName, t.applicationName, err) } The only part I have not found yet is explicit blocks on leader-deposed on the apiserver side. What I think we need: 1. leadership-tracker tries to renew the lease; 2. fails as the token has expired; 3. runs leader-deposed hook; 3. meanwhile, apiserver doesn't allow anybody else to claim leadership unit it got EXPLICIT notification from the former leader that it is done with running leader-deposed. Do we have that in the codebase somewhere? The last code-path I looked at is below. It doesn't seem like there are any checks that say "former leader has finished doing leader-deposed": http://paste.ubuntu.com/25996658/ func (manager *Manager) WaitUntilExpired(leaseName string) error { ... if err := manager.config.Secretary.CheckLease(leaseName); err != nil { A Secretary just verifies basic things. state/leadership.go // CheckLease is part of the lease.Secretary interface. func (leadershipSecretary) CheckLease(name string) error { if !names.IsValidApplication(name) { return errors.NewNotValid(nil, "not an application name") } return nil } I haven't looked any further yet so maybe I was not looking at the right place. Any ideas about the above?