Could we provide a guarantee that no unit of a given application will ever consider itself a leader until a previous leader has been deposed in apiserver's view? Likewise, an apiserver should not give any leader tokens until it receives a confirmation that the previous leader has been deposed and ran that hook.

The latter condition is a strong requirement as if there is a network partition and a unit agent is no longer available, apiserver will never elect a new leader. If we introduce a timeout for that this may result in a split-brain unless a unit agent is required to stop executing further operations if there is a connection loss with the apiserver.

We cannot just stop a hook execution because inherently a charm may spawn threads and processes on it's own will which may daemonize and do other arbitrary things on a system during hook execution. Any process tracking mechanisms are operating system-specific (e.g. cgroups) and they can be escaped so we shouldn't even look that way.

The complicated part is that a unit <-> apiserver connection may be lost but a service-level network may be fine (i.e. the loss of L1-relevant connectivity doesn't mean services on L2 have the same picture) - this is the case where we have ToR and BoR switches providing service and management networks respectively on different physical media (switch fabrics). This is a common scenario for us (that's why we have network spaces). In other words: there may be an L1-related partition but not L2-related partition.

I think that in this case a partitioned unit should run leader-deposed which may run L2-related checks to see if this is only the unit <-> apiserver connectivity problem. This is an interesting scenario as the unit agent is isolated in this case and cannot get anything from the apiserver (can't do facade RPC). However, I think this is a useful scenario to model.

As an operator, would you do something like that with your system? Probably yes, you would go out-of-band or in-person and check if this problem impacts only Juju-related connectivity and decide upon service-level impact - this is what you should have in the charm in leader-deposed hook.

===

Now, to having one per-app leader unit running at a time, I believe this is, at least partially, present in Juju.

https://github.com/juju/juju/blob/juju-2.3-rc1/worker/leadership/tracker.go#L206-L227
// setMinion arranges for lease acquisition when there's an opportunity.
func (t *Tracker) setMinion() error {
...
		t.claimLease = make(chan struct{})
		go func() {
			defer close(t.claimLease)
			logger.Debugf("%s waiting for %s leadership release", t.unitName, t.applicationName)
			err := t.claimer.BlockUntilLeadershipReleased(t.applicationName)
			if err != nil {
				logger.Debugf("error while %s waiting for %s leadership release: %v", t.unitName, t.applicationName, err)
			}


The only part I have not found yet is explicit blocks on leader-deposed on the apiserver side. 

What I think we need:

1. leadership-tracker tries to renew the lease;
2. fails as the token has expired;
3. runs leader-deposed hook;
3. meanwhile, apiserver doesn't allow anybody else to claim leadership unit it got EXPLICIT notification from the former leader that it is done with running leader-deposed.

Do we have that in the codebase somewhere?

The last code-path I looked at is below. It doesn't seem like there are any checks that say "former leader has finished doing leader-deposed":

http://paste.ubuntu.com/25996658/
func (manager *Manager) WaitUntilExpired(leaseName string) error {

...
	if err := manager.config.Secretary.CheckLease(leaseName); err != nil {

A Secretary just verifies basic things.

state/leadership.go
// CheckLease is part of the lease.Secretary interface.
func (leadershipSecretary) CheckLease(name string) error {
	if !names.IsValidApplication(name) {
		return errors.NewNotValid(nil, "not an application name")
	}
	return nil
}

I haven't looked any further yet so maybe I was not looking at the right place.


Any ideas about the above?