Juju 2.8 allows leadership code to run on non-leader

Bug #1898966 reported by Kenneth Koski
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Ian Booth

Bug Description

I have a charm that has an explicit leadership test:

https://github.com/juju-solutions/bundle-kubeflow/blob/master/charms/dex-auth/reactive/dex_auth.py#L45-L47

Nevertheless, I'm seeing a traceback related to attempting to set the pod spec while not a leader:

application-dex-auth: 22:48:54 ERROR juju.worker.uniter.context "dex-auth/0" is not the leader but is setting application k8s spec
application-dex-auth: 22:48:54 DEBUG jujuc running hook tool "juju-log"
application-dex-auth: 22:48:54 ERROR unit.dex-auth/0.juju-log oidc-client:4: pod-spec-set encountered an error: `ERROR this unit is not the leader`
application-dex-auth: 22:48:54 DEBUG jujuc running hook tool "juju-log"
application-dex-auth: 22:48:54 ERROR unit.dex-auth/0.juju-log oidc-client:4: Hook error:
Traceback (most recent call last):
  File "lib/charms/reactive/__init__.py", line 74, in main
    bus.dispatch(restricted=restricted_mode)
  File "lib/charms/reactive/bus.py", line 390, in dispatch
    _invoke(other_handlers)
  File "lib/charms/reactive/bus.py", line 359, in _invoke
    handler.invoke()
  File "lib/charms/reactive/bus.py", line 181, in invoke
    self._action(*args)
  File "/var/lib/juju/agents/unit-dex-auth-0/charm/reactive/dex_auth.py", line 120, in start_charm
    layer.caas_base.pod_spec_set(
  File "lib/charms/layer/caas_base.py", line 34, in pod_spec_set
    run_hook_command("pod-spec-set", spec)
  File "lib/charms/layer/caas_base.py", line 13, in run_hook_command
    run([cmd], stdout=PIPE, stderr=PIPE, check=True, input=stdin.encode('utf-8'))
  File "/usr/lib/python3.8/subprocess.py", line 512, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['pod-spec-set']' returned non-zero exit status 1.

See this run for an example failure:

https://github.com/juju-solutions/bundle-kubeflow/runs/1223088751?check_suite_focus=true

Tags: k8s
Revision history for this message
Ian Booth (wallyworld) wrote :

Initial thought, could be wrong....

If the current pod got terminated and a new pod spun up, it could be that the leadership changed between checking and doing to pod spec set operation. The newly created unit would grab leadership and the (now terminated) unit would lose it. So I think the charm may need to deal with this.

tags: added: k8s
Changed in juju:
importance: Undecided → High
status: New → Triaged
milestone: none → 2.8.6
Revision history for this message
Kenneth Koski (knkski) wrote :

That seems likely. Is there any way Juju can handle this itself? As a charm author, it would be nice to just have some way of telling Juju "only the leader unit should respond to any hooks other than teardown, also don't change leadership when any unit is running hooks".

Revision history for this message
Ian Booth (wallyworld) wrote :

Leadership lease expiry is managed separately in the controller, so there will always be a window where a hook may start and leadership could be lost part way through. The controller has no visibility to what the unit agent is doing (there are separate components in a distributed system) so any such coordination is impractical. Now that we store agent state in the controller though, maybe we can revisit that, but the complexity may not pay for itself.

Generally so far as I understand it, charms are written to catch leadership exceptions and exit the hook gracefully.

Generally, all units need to be informed when something happens, eg consume relation data change etc. So not having minions run hooks isn't really practical.

Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1898966] Re: Juju 2.8 allows leadership code to run on non-leader

As long as the unit agent is active, it would be extending the lease. Is
the issue that the Juju controller is independently determining that the
unit pod has gone away and revoking the lease before it has expired?
Generally with charms we don't expect them to have to keep polling that
they are still the leader, because that code ends up just being very ugly
(every command is wrapped in a check).
It seems like the teardown of leases is being too aggressive (we didn't
traditionally run into this because on machines agents can't run if the
machine is torn down).

Given what most people have done with leadership and k8s charms, I'd rather
have more of a "you only have 1 interesting agent and it is the leader".
I definitely wouldn't want to interrupt a unit agent while it is running as
the leader, because it has important work to get done.

On Thu, Oct 8, 2020 at 12:00 PM Ian Booth <email address hidden>
wrote:

> Leadership lease expiry is managed separately in the controller, so
> there will always be a window where a hook may start and leadership
> could be lost part way through. The controller has no visibility to what
> the unit agent is doing (there are separate components in a distributed
> system) so any such coordination is impractical. Now that we store agent
> state in the controller though, maybe we can revisit that, but the
> complexity may not pay for itself.
>
> Generally so far as I understand it, charms are written to catch
> leadership exceptions and exit the hook gracefully.
>
> Generally, all units need to be informed when something happens, eg
> consume relation data change etc. So not having minions run hooks isn't
> really practical.
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1898966
>
> Title:
> Juju 2.8 allows leadership code to run on non-leader
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1898966/+subscriptions
>

Revision history for this message
Yang Kelvin Liu (kelvin.liu) wrote :

Hi ken,
I just tested on GKE cluster, and can't reproduce it.
here is the status of the deployment - https://pastebin.ubuntu.com/p/rBQsQCc5NM/.
It's probably already fixed by recently comments.
I will close this one for now.
Please feel free to reopen it when you experience this again.
Thank you.

Changed in juju:
status: Triaged → In Progress
assignee: nobody → Yang Kelvin Liu (kelvin.liu)
status: In Progress → Incomplete
John A Meinel (jameinel)
Changed in juju:
status: Incomplete → Fix Committed
Revision history for this message
Stuart Bishop (stub) wrote :

Charms cannot handle this sort of failure gracefully, because the hook environment tools such as pod-set-spec, relation-set & leader-set do not return specific error codes for 'leader operation attempted on non-leader'.

Revision history for this message
Ian Booth (wallyworld) wrote :

I'm seeing some weird behaviour on a heavily loaded k8s node where a rolling update occurs. The update means that ultimately unit/0 which was the leader is terminated and a new unit/1 which is created for the new pod which appears becomes leader. In this case, the CPU load is causing timeouts with Juju trying to contact the new pod.

20:17:03 dex-auth charm deployed
20:17:25 juju.worker.uniter.charm deploying charm "local:kubernetes/dex-auth-0"
20:17:28 juju.worker.leadership dex-auth/0 making initial claim for dex-auth leadership
20:17:28 juju.worker.uniter unit "dex-auth/0" started
20:17:49 unit.dex-auth/0.juju-log Reactive main running for hook leader-elected

So far so good.
Then dex-auth/1 unit appears because a new pod shows up

20:29:45 DEBUG juju.worker.uniter starting uniter for "dex-auth/1"
20:29:45 DEBUG juju.worker.caasoperator start "dex-auth/1"

And dex-auth/1 is not the leader...

20:29:53 DEBUG jujuc running hook tool "is-leader"
20:29:53 DEBUG juju.worker.leadership dex-auth/1 is not dex-auth leader

But dex-auth/0 is the leader...

20:30:03 DEBUG jujuc running hook tool "is-leader"
20:30:03 INFO unit.dex-auth/0.juju-log oidc-client:7: unit -> dex-auth/0 True

The system is busy so Juju cannot get to the pod to talk to the init container agent

20:30:05 juju.worker.caasoperator request running status for "unit-dex-auth-1" dex-auth-6955f5f57f-m78sp
20:30:05 juju.worker.caasoperator remote init for "unit-dex-auth-1" {PodName:dex-auth-6955f5f57f-m78sp Initialising:false InitialisingTime:0001-01-01 00:00:00 +0000 UTC Running:false}
20:30:05 juju.worker.caasoperator retrying exec request, in 5 attempt, container not running not found
20:30:05 juju.worker.uniter resolver loop error: executing operation "remote init": attempt count exceeded: container not running not found

And at this time dex-auth/0 loses leadership because dex-auth/1 becomes leader

20:30:05 juju.worker.uniter.remotestate got leadership change for dex-auth/1: leader

While the connection error causes the init agent for dex-auth/1 to restart

20:30:05 juju.worker.caasoperator restarting "dex-auth/1" in 3s

But

On restart, dex-auth/1 is not the leader anymore... why? The lease time for k8s units is 15 seconds, meaning 30 seconds is needed to revoke the lease.

20:30:08 juju.worker.caasoperator start "dex-auth/1"
20:30:08 juju.worker.caasoperator "dex-auth/1" started
20:30:08 juju.worker.leadership dex-auth/1 making initial claim for dex-auth leadership
20:30:08 juju.worker.leadership dex-auth leadership for dex-auth/1 denied

There's also now dex-auth/2 and dex-auth/3 in the works due to mpre set-pod-spec calls causing rolling updates, but these aren't leader either.
dex-auth/0 might be leader but the next we hear from it, it is starting to shutdown as it was marked as dying since its pod was terminated

20:30:57 unit.dex-auth/0.juju-log oidc-client:7: Reactive main running for hook oidc-client-relation-departed

We'll need to capture the lease.log file to see what happened with leadership.

Changed in juju:
status: Fix Committed → Triaged
Ian Booth (wallyworld)
Changed in juju:
milestone: 2.8.6 → 2.8.7
John A Meinel (jameinel)
Changed in juju:
status: Triaged → In Progress
Revision history for this message
Yang Kelvin Liu (kelvin.liu) wrote :

The root cause of bug is there is a bug in the operator facade (It returns permission denied rather than not found error when a non existing unit was trying to be removed). This permission denied error was not handled in operator worker properly which causes the operator worker shutdown. Then all uniters start to shutdown. When the leader unit's leadership tracker worker is dying, it returns `false` for `is-leader` check which confuses both SetPodSpec hook tool and remote state watcher. So it's not real leadership revoked on the controller side.

Changed in juju:
assignee: Yang Kelvin Liu (kelvin.liu) → Ian Booth (wallyworld)
Revision history for this message
Ian Booth (wallyworld) wrote :

Here's a PR to fix the major issues
https://github.com/juju/juju/pull/12161

PR contains a bit more info. 2 root causes:
- juju "optimising" a unit destroy operation by prematurely removing the unit even though the unit agent was still running
- incorrect permission check returning a permission error if the application agent attempted to access a unit that had been removed

The end result was that leadership could be taken away from a unit during a hook execution.

Ian Booth (wallyworld)
Changed in juju:
status: In Progress → Fix Committed
Revision history for this message
Kenneth Koski (knkski) wrote :

I got another failure related to this:

https://github.com/juju-solutions/bundle-kubeflow/runs/1392342134

Relevant log line:

application-dex-auth: 19:45:40 ERROR juju.worker.uniter.operation hook "oidc-client-relation-joined" (via explicit, bespoke hook script) failed: this unit is not the leader

Revision history for this message
Kenneth Koski (knkski) wrote :

juju debug log for recent failure

Revision history for this message
Kenneth Koski (knkski) wrote :

stdout for each kubeflow pod for recent failure

Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.