Bug #1898966 “Juju 2.8 allows leadership code to run on non-lead...” : Bugs : Canonical Juju

Revision history for this message

Ian Booth (wallyworld) wrote on 2020-10-08:

#1

Initial thought, could be wrong....

If the current pod got terminated and a new pod spun up, it could be that the leadership changed between checking and doing to pod spec set operation. The newly created unit would grab leadership and the (now terminated) unit would lose it. So I think the charm may need to deal with this.

tags:	added: k8s
Changed in juju:
importance:	Undecided → High
status:	New → Triaged
milestone:	none → 2.8.6

Revision history for this message

Kenneth Koski (knkski) wrote on 2020-10-08:

#2

That seems likely. Is there any way Juju can handle this itself? As a charm author, it would be nice to just have some way of telling Juju "only the leader unit should respond to any hooks other than teardown, also don't change leadership when any unit is running hooks".

Revision history for this message

Ian Booth (wallyworld) wrote on 2020-10-08:

#3

Leadership lease expiry is managed separately in the controller, so there will always be a window where a hook may start and leadership could be lost part way through. The controller has no visibility to what the unit agent is doing (there are separate components in a distributed system) so any such coordination is impractical. Now that we store agent state in the controller though, maybe we can revisit that, but the complexity may not pay for itself.

Generally so far as I understand it, charms are written to catch leadership exceptions and exit the hook gracefully.

Generally, all units need to be informed when something happens, eg consume relation data change etc. So not having minions run hooks isn't really practical.

Revision history for this message

John A Meinel (jameinel) wrote on 2020-10-08: Re: [Bug 1898966] Re: Juju 2.8 allows leadership code to run on non-leader

#4

As long as the unit agent is active, it would be extending the lease. Is
the issue that the Juju controller is independently determining that the
unit pod has gone away and revoking the lease before it has expired?
Generally with charms we don't expect them to have to keep polling that
they are still the leader, because that code ends up just being very ugly
(every command is wrapped in a check).
It seems like the teardown of leases is being too aggressive (we didn't
traditionally run into this because on machines agents can't run if the
machine is torn down).

Given what most people have done with leadership and k8s charms, I'd rather
have more of a "you only have 1 interesting agent and it is the leader".
I definitely wouldn't want to interrupt a unit agent while it is running as
the leader, because it has important work to get done.

On Thu, Oct 8, 2020 at 12:00 PM Ian Booth <email address hidden>
wrote:

> Leadership lease expiry is managed separately in the controller, so
> there will always be a window where a hook may start and leadership
> could be lost part way through. The controller has no visibility to what
> the unit agent is doing (there are separate components in a distributed
> system) so any such coordination is impractical. Now that we store agent
> state in the controller though, maybe we can revisit that, but the
> complexity may not pay for itself.
>
> Generally so far as I understand it, charms are written to catch
> leadership exceptions and exit the hook gracefully.
>
> Generally, all units need to be informed when something happens, eg
> consume relation data change etc. So not having minions run hooks isn't
> really practical.
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1898966
>
> Title:
> Juju 2.8 allows leadership code to run on non-leader
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1898966/+subscriptions
>

As long as the unit agent is active, it would be extending the lease. Is
the issue that the Juju controller is independently determining that the
unit pod has gone away and revoking the lease before it has expired?
Generally with charms we don't expect them to have to keep polling that
they are still the leader, because that code ends up just being very ugly
(every command is wrapped in a check).
It seems like the teardown of leases is being too aggressive (we didn't
traditionally run into this because on machines agents can't run if the
machine is torn down).

Given what most people have done with leadership and k8s charms, I'd rather
have more of a "you only have 1 interesting agent and it is the leader".
I definitely wouldn't want to interrupt a unit agent while it is running as
the leader, because it has important work to get done.

On Thu, Oct 8, 2020 at 12:00 PM Ian Booth <1898966@bugs.launchpad.net>
wrote:

> Leadership lease expiry is managed separately in the controller, so
> there will always be a window where a hook may start and leadership
> could be lost part way through. The controller has no visibility to what
> the unit agent is doing (there are separate components in a distributed
> system) so any such coordination is impractical. Now that we store agent
> state in the controller though, maybe we can revisit that, but the
> complexity may not pay for itself.
>
> Generally so far as I understand it, charms are written to catch
> leadership exceptions and exit the hook gracefully.
>
> Generally, all units need to be informed when something happens, eg
> consume relation data change etc. So not having minions run hooks isn't
> really practical.
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1898966
>
> Title:
>   Juju 2.8 allows leadership code to run on non-leader
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1898966/+subscriptions
>

Revision history for this message

Yang Kelvin Liu (kelvin.liu) wrote on 2020-10-13:

#5

Hi ken,
I just tested on GKE cluster, and can't reproduce it.
here is the status of the deployment - https://pastebin.ubuntu.com/p/rBQsQCc5NM/.
It's probably already fixed by recently comments.
I will close this one for now.
Please feel free to reopen it when you experience this again.
Thank you.

Changed in juju:
status:	Triaged → In Progress
assignee:	nobody → Yang Kelvin Liu (kelvin.liu)
status:	In Progress → Incomplete

John A Meinel (jameinel) on 2020-10-13

Changed in juju:
status:	Incomplete → Fix Committed

Revision history for this message

Stuart Bishop (stub) wrote on 2020-10-14:

#6

Charms cannot handle this sort of failure gracefully, because the hook environment tools such as pod-set-spec, relation-set & leader-set do not return specific error codes for 'leader operation attempted on non-leader'.

Revision history for this message

Ian Booth (wallyworld) wrote on 2020-10-15:

#7

I'm seeing some weird behaviour on a heavily loaded k8s node where a rolling update occurs. The update means that ultimately unit/0 which was the leader is terminated and a new unit/1 which is created for the new pod which appears becomes leader. In this case, the CPU load is causing timeouts with Juju trying to contact the new pod.

20:17:03 dex-auth charm deployed
20:17:25 juju.worker.uniter.charm deploying charm "local:kubernetes/dex-auth-0"
20:17:28 juju.worker.leadership dex-auth/0 making initial claim for dex-auth leadership
20:17:28 juju.worker.uniter unit "dex-auth/0" started
20:17:49 unit.dex-auth/0.juju-log Reactive main running for hook leader-elected

So far so good.
Then dex-auth/1 unit appears because a new pod shows up

20:29:45 DEBUG juju.worker.uniter starting uniter for "dex-auth/1"
20:29:45 DEBUG juju.worker.caasoperator start "dex-auth/1"

And dex-auth/1 is not the leader...

20:29:53 DEBUG jujuc running hook tool "is-leader"
20:29:53 DEBUG juju.worker.leadership dex-auth/1 is not dex-auth leader

But dex-auth/0 is the leader...

20:30:03 DEBUG jujuc running hook tool "is-leader"
20:30:03 INFO unit.dex-auth/0.juju-log oidc-client:7: unit -> dex-auth/0 True

The system is busy so Juju cannot get to the pod to talk to the init container agent

20:30:05 juju.worker.caasoperator request running status for "unit-dex-auth-1" dex-auth-6955f5f57f-m78sp
20:30:05 juju.worker.caasoperator remote init for "unit-dex-auth-1" {PodName:dex-auth-6955f5f57f-m78sp Initialising:false InitialisingTime:0001-01-01 00:00:00 +0000 UTC Running:false}
20:30:05 juju.worker.caasoperator retrying exec request, in 5 attempt, container not running not found
20:30:05 juju.worker.uniter resolver loop error: executing operation "remote init": attempt count exceeded: container not running not found

And at this time dex-auth/0 loses leadership because dex-auth/1 becomes leader

20:30:05 juju.worker.uniter.remotestate got leadership change for dex-auth/1: leader

While the connection error causes the init agent for dex-auth/1 to restart

20:30:05 juju.worker.caasoperator restarting "dex-auth/1" in 3s

But

On restart, dex-auth/1 is not the leader anymore... why? The lease time for k8s units is 15 seconds, meaning 30 seconds is needed to revoke the lease.

20:30:08 juju.worker.caasoperator start "dex-auth/1"
20:30:08 juju.worker.caasoperator "dex-auth/1" started
20:30:08 juju.worker.leadership dex-auth/1 making initial claim for dex-auth leadership
20:30:08 juju.worker.leadership dex-auth leadership for dex-auth/1 denied

There's also now dex-auth/2 and dex-auth/3 in the works due to mpre set-pod-spec calls causing rolling updates, but these aren't leader either.
dex-auth/0 might be leader but the next we hear from it, it is starting to shutdown as it was marked as dying since its pod was terminated

20:30:57 unit.dex-auth/0.juju-log oidc-client:7: Reactive main running for hook oidc-client-relation-departed

We'll need to capture the lease.log file to see what happened with leadership.

I'm seeing some weird behaviour on a heavily loaded k8s node where a rolling update occurs. The update means that ultimately unit/0 which was the leader is terminated and a new unit/1 which is created for the new pod which appears becomes leader. In this case, the CPU load is causing timeouts with Juju trying to contact the new pod.

20:17:03 dex-auth charm deployed
20:17:25 juju.worker.uniter.charm deploying charm "local:kubernetes/dex-auth-0"
20:17:28 juju.worker.leadership dex-auth/0 making initial claim for dex-auth leadership
20:17:28 juju.worker.uniter unit "dex-auth/0" started
20:17:49 unit.dex-auth/0.juju-log Reactive main running for hook leader-elected

So far so good.
Then dex-auth/1 unit appears because a new pod shows up

20:29:45 DEBUG juju.worker.uniter starting uniter for "dex-auth/1"
20:29:45 DEBUG juju.worker.caasoperator start "dex-auth/1"

And dex-auth/1 is not the leader...

20:29:53 DEBUG jujuc running hook tool "is-leader"
20:29:53 DEBUG juju.worker.leadership dex-auth/1 is not dex-auth leader

But dex-auth/0 is the leader...

20:30:03 DEBUG jujuc running hook tool "is-leader"
20:30:03 INFO unit.dex-auth/0.juju-log oidc-client:7: unit -> dex-auth/0 True

The system is busy so Juju cannot get to the pod to talk to the init container agent
 
20:30:05 juju.worker.caasoperator request running status for "unit-dex-auth-1" dex-auth-6955f5f57f-m78sp
20:30:05 juju.worker.caasoperator remote init for "unit-dex-auth-1" {PodName:dex-auth-6955f5f57f-m78sp Initialising:false InitialisingTime:0001-01-01 00:00:00 +0000 UTC Running:false}
20:30:05 juju.worker.caasoperator retrying exec request, in 5 attempt, container not running not found
20:30:05 juju.worker.uniter resolver loop error: executing operation "remote init": attempt count exceeded: container not running not found

And at this time dex-auth/0 loses leadership because dex-auth/1 becomes leader

20:30:05 juju.worker.uniter.remotestate got leadership change for dex-auth/1: leader

While the connection error causes the init agent for dex-auth/1 to restart

20:30:05 juju.worker.caasoperator restarting "dex-auth/1" in 3s

But

On restart, dex-auth/1 is not the leader anymore... why? The lease time for k8s units is 15 seconds, meaning 30 seconds is needed to revoke the lease.

20:30:08 juju.worker.caasoperator start "dex-auth/1"
20:30:08 juju.worker.caasoperator "dex-auth/1" started
20:30:08 juju.worker.leadership dex-auth/1 making initial claim for dex-auth leadership
20:30:08 juju.worker.leadership dex-auth leadership for dex-auth/1 denied

There's also now dex-auth/2 and dex-auth/3 in the works due to mpre set-pod-spec calls causing rolling updates, but these aren't leader either.
dex-auth/0 might be leader but the next we hear from it, it is starting to shutdown as it was marked as dying since its pod was terminated

20:30:57 unit.dex-auth/0.juju-log oidc-client:7: Reactive main running for hook oidc-client-relation-departed

We'll need to capture the lease.log file to see what happened with leadership.

Changed in juju:
status:	Fix Committed → Triaged

Ian Booth (wallyworld) on 2020-10-16

Changed in juju:
milestone:	2.8.6 → 2.8.7

John A Meinel (jameinel) on 2020-10-19

Changed in juju:
status:	Triaged → In Progress

Revision history for this message

Yang Kelvin Liu (kelvin.liu) wrote on 2020-10-19:

#8

The root cause of bug is there is a bug in the operator facade (It returns permission denied rather than not found error when a non existing unit was trying to be removed). This permission denied error was not handled in operator worker properly which causes the operator worker shutdown. Then all uniters start to shutdown. When the leader unit's leadership tracker worker is dying, it returns `false` for `is-leader` check which confuses both SetPodSpec hook tool and remote state watcher. So it's not real leadership revoked on the controller side.

Yang Kelvin Liu (kelvin.liu) on 2020-10-20

Changed in juju:
assignee:	Yang Kelvin Liu (kelvin.liu) → Ian Booth (wallyworld)

Revision history for this message

Ian Booth (wallyworld) wrote on 2020-10-20:

#9

Here's a PR to fix the major issues
https://github.com/juju/juju/pull/12161

PR contains a bit more info. 2 root causes:
- juju "optimising" a unit destroy operation by prematurely removing the unit even though the unit agent was still running
- incorrect permission check returning a permission error if the application agent attempted to access a unit that had been removed

The end result was that leadership could be taken away from a unit during a hook execution.

Ian Booth (wallyworld) on 2020-10-20

Changed in juju:
status:	In Progress → Fix Committed

Revision history for this message

Kenneth Koski (knkski) wrote on 2020-11-12:

#10

I got another failure related to this:

https://github.com/juju-solutions/bundle-kubeflow/runs/1392342134

Relevant log line:

application-dex-auth: 19:45:40 ERROR juju.worker.uniter.operation hook "oidc-client-relation-joined" (via explicit, bespoke hook script) failed: this unit is not the leader

Revision history for this message

Kenneth Koski (knkski) wrote on 2020-11-12:

#11

juju-debug-aws.zip Edit (230.7 KiB, application/zip)

juju debug log for recent failure

Revision history for this message

Kenneth Koski (knkski) wrote on 2020-11-12:

#12

kubectl-stdout-aws.zip Edit (828.8 KiB, application/zip)

stdout for each kubeflow pod for recent failure

Canonical Juju QA Bot (juju-qa-bot) on 2020-12-14

Changed in juju:
status:	Fix Committed → Fix Released

Canonical Juju

Juju 2.8 allows leadership code to run on non-leader

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches