Comment 7 for bug 1898966

Revision history for this message
Ian Booth (wallyworld) wrote :

I'm seeing some weird behaviour on a heavily loaded k8s node where a rolling update occurs. The update means that ultimately unit/0 which was the leader is terminated and a new unit/1 which is created for the new pod which appears becomes leader. In this case, the CPU load is causing timeouts with Juju trying to contact the new pod.

20:17:03 dex-auth charm deployed
20:17:25 juju.worker.uniter.charm deploying charm "local:kubernetes/dex-auth-0"
20:17:28 juju.worker.leadership dex-auth/0 making initial claim for dex-auth leadership
20:17:28 juju.worker.uniter unit "dex-auth/0" started
20:17:49 unit.dex-auth/0.juju-log Reactive main running for hook leader-elected

So far so good.
Then dex-auth/1 unit appears because a new pod shows up

20:29:45 DEBUG juju.worker.uniter starting uniter for "dex-auth/1"
20:29:45 DEBUG juju.worker.caasoperator start "dex-auth/1"

And dex-auth/1 is not the leader...

20:29:53 DEBUG jujuc running hook tool "is-leader"
20:29:53 DEBUG juju.worker.leadership dex-auth/1 is not dex-auth leader

But dex-auth/0 is the leader...

20:30:03 DEBUG jujuc running hook tool "is-leader"
20:30:03 INFO unit.dex-auth/0.juju-log oidc-client:7: unit -> dex-auth/0 True

The system is busy so Juju cannot get to the pod to talk to the init container agent

20:30:05 juju.worker.caasoperator request running status for "unit-dex-auth-1" dex-auth-6955f5f57f-m78sp
20:30:05 juju.worker.caasoperator remote init for "unit-dex-auth-1" {PodName:dex-auth-6955f5f57f-m78sp Initialising:false InitialisingTime:0001-01-01 00:00:00 +0000 UTC Running:false}
20:30:05 juju.worker.caasoperator retrying exec request, in 5 attempt, container not running not found
20:30:05 juju.worker.uniter resolver loop error: executing operation "remote init": attempt count exceeded: container not running not found

And at this time dex-auth/0 loses leadership because dex-auth/1 becomes leader

20:30:05 juju.worker.uniter.remotestate got leadership change for dex-auth/1: leader

While the connection error causes the init agent for dex-auth/1 to restart

20:30:05 juju.worker.caasoperator restarting "dex-auth/1" in 3s

But

On restart, dex-auth/1 is not the leader anymore... why? The lease time for k8s units is 15 seconds, meaning 30 seconds is needed to revoke the lease.

20:30:08 juju.worker.caasoperator start "dex-auth/1"
20:30:08 juju.worker.caasoperator "dex-auth/1" started
20:30:08 juju.worker.leadership dex-auth/1 making initial claim for dex-auth leadership
20:30:08 juju.worker.leadership dex-auth leadership for dex-auth/1 denied

There's also now dex-auth/2 and dex-auth/3 in the works due to mpre set-pod-spec calls causing rolling updates, but these aren't leader either.
dex-auth/0 might be leader but the next we hear from it, it is starting to shutdown as it was marked as dying since its pod was terminated

20:30:57 unit.dex-auth/0.juju-log oidc-client:7: Reactive main running for hook oidc-client-relation-departed

We'll need to capture the lease.log file to see what happened with leadership.