Juju 2.8 silent hook failure

Bug #1898792 reported by Kenneth Koski
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Yang Kelvin Liu

Bug Description

I have 2 charms with a relation between them, dex-auth and oidc-gatekeeper. Occasionally, the relation joined hook fails, without useful debugging information in the logs. See here for an example failure:

https://github.com/juju-solutions/bundle-kubeflow/pull/225/checks?check_run_id=1217310448

The relevant line is under `Debug failures`:

dex-auth/2 error idle 10.1.8.52 5556/TCP hook failed: "oidc-client-relation-joined"

I'm actually unable to find a failure in the logs for `oidc-client-relation-joined`, but I did find this failure:

2020-10-06T21:36:22.953939641Z 2020-10-06 21:36:22 DEBUG juju.worker.uniter agent.go:20 [AGENT-STATUS] error: hook failed: "start"

With no indication of what might have actually failed. It's possibly related to this error a few lines above:

2020-10-06T21:36:19.246284820Z 2020-10-06 21:36:19 ERROR juju.worker.caasoperator runner.go:442 exited "dex-auth/1": executing operation "remote init": caas-unit-init for unit "dex-auth/2" with command: "/var/lib/juju/tools/jujud caas-unit-init --unit unit-dex-auth-2 --charm-dir /tmp/unit-dex-auth-2455037127/charm --send --operator-file /tmp/unit-dex-auth-2455037127/operator-client-cache.yaml --operator-ca-cert-file /tmp/unit-dex-auth-2455037127/ca.crt" failed: : Internal error occurred: error executing command in container: failed to exec in container: failed to create exec "bfab0180264e3b46f4e60bf275663f3f0935265e938e5e3ac8354fb5f6be6d08": cannot exec in a stopped state: unknown

But it's hard to tell exactly what's going on

Tags: k8s
Kenneth Koski (knkski)
summary: - Juju 2.8 silent relation failure
+ Juju 2.8 silent hook failure
Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1898792] [NEW] Juju 2.8 silent hook failure

That sounds like the charm is trying to run a command in the application
pod, but the application pod is currently "stopped" and thus you cannot do
exec.

On Tue, Oct 6, 2020 at 5:10 PM Kenneth Koski <email address hidden>
wrote:

> Public bug reported:
>
> I have 2 charms with a relation between them, dex-auth and oidc-
> gatekeeper. Occasionally, the relation joined hook fails, without useful
> debugging information in the logs. See here for an example failure:
>
> https://github.com/juju-solutions/bundle-
> kubeflow/pull/225/checks?check_run_id=1217310448
>
> The relevant line is under `Debug failures`:
>
> dex-auth/2 error idle 10.1.8.52
> 5556/TCP hook failed: "oidc-client-relation-joined"
>
> I'm actually unable to find a failure in the logs for `oidc-client-
> relation-joined`, but I did find this failure:
>
> 2020-10-06T21:36:22.953939641Z 2020-10-06 21:36:22 DEBUG
> juju.worker.uniter agent.go:20 [AGENT-STATUS] error: hook failed:
> "start"
>
> With no indication of what might have actually failed. It's possibly
> related to this error a few lines above:
>
> 2020-10-06T21:36:19.246284820Z 2020-10-06 21:36:19 ERROR
> juju.worker.caasoperator runner.go:442 exited "dex-auth/1": executing
> operation "remote init": caas-unit-init for unit "dex-auth/2" with
> command: "/var/lib/juju/tools/jujud caas-unit-init --unit unit-dex-
> auth-2 --charm-dir /tmp/unit-dex-auth-2455037127/charm --send
> --operator-file /tmp/unit-dex-auth-2455037127/operator-client-cache.yaml
> --operator-ca-cert-file /tmp/unit-dex-auth-2455037127/ca.crt" failed: :
> Internal error occurred: error executing command in container: failed to
> exec in container: failed to create exec
> "bfab0180264e3b46f4e60bf275663f3f0935265e938e5e3ac8354fb5f6be6d08":
> cannot exec in a stopped state: unknown
>
> But it's hard to tell exactly what's going on
>
> ** Affects: juju
> Importance: Undecided
> Status: New
>
> ** Summary changed:
>
> - Juju 2.8 silent relation failure
> + Juju 2.8 silent hook failure
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1898792
>
> Title:
> Juju 2.8 silent hook failure
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1898792/+subscriptions
>

Revision history for this message
Kenneth Koski (knkski) wrote :

The interesting thing is that it looks like the pod in question exists:

dex-auth/0* terminated idle 10.1.80.20 5556/TCP unit stopped by the cloud
dex-auth/1 terminated failed 10.1.80.51 5556/TCP unit stopped by the cloud
dex-auth/2 error idle 10.1.80.52 5556/TCP hook failed: "start"

Here's what kubectl sees:

NAMESPACE NAME READY STATUS RESTARTS AGE
kubeflow dex-auth-6b568d4ddd-vh2qf 1/1 Running 0 37s
kubeflow dex-auth-844c5ffdb-bm4gv 0/1 Terminating 1 9m57s

I assume that `dex-auth-6b568d4ddd-vh2qf` == `dex-auth/2`, which doesn't explain why `dex-auth/2` has a failing start hook. Perhaps the error is being attributed to the wrong unit?

Revision history for this message
Ian Booth (wallyworld) wrote :

juju records the k8s uid of the pod as the provider-id attribute which you can see in juju status --format yaml. So you can match up the pod and the unit.
(for statefulset pods it is the pod name rather than the uid)

tags: added: k8s
Revision history for this message
Kenneth Koski (knkski) wrote :

I'm not sure if this is the same issue or not, but I also noticed this error in `juju debug`:

    application-dex-auth: 22:52:59 DEBUG juju.worker.uniter [AGENT-STATUS] failed: resolver loop error

Is actually reflected in the output of `juju status --format yaml`:

       dex-auth/1:
        workload-status:
          current: terminated
          message: unit stopped by the cloud
          since: 07 Oct 2020 22:52:28Z
        juju-status:
          current: failed
          message: resolver loop error
          since: 07 Oct 2020 22:52:47Z
        open-ports:
        - 5556/TCP
        address: 10.1.31.51
        provider-id: 7c878fae-de9d-43c5-96bb-a0e87a63bd3d

But not regular `juju status`:

    dex-auth/1 terminated failed 10.1.31.51 5556/TCP unit stopped by the cloud

There's no additional information in `juju debug-log`, and it's also not clear which unit is experiencing the `resolver loop error` from the logs, which seems suboptimal.

Changed in juju:
assignee: nobody → Yang Kelvin Liu (kelvin.liu)
importance: Undecided → High
status: New → Triaged
milestone: none → 2.8.6
Revision history for this message
Yang Kelvin Liu (kelvin.liu) wrote :

I think the root cause is the uniter worker map `map[uniterID1]uniterWorker(uniterID2)` inside the runner in operator having the `uniterID1 != uniterID2` sometimes especially when the application scaled to a big number of pods or some uniter restarted due to an error.
https://github.com/juju/juju/pull/12120 will be landed to 2.8 to fix this issue.

Changed in juju:
status: Triaged → In Progress
Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.