Continuous rotation of active unit in K8s charm

Bug #1892791 reported by Kenneth Koski
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Invalid
High
Thomas Miller

Bug Description

I am deploying two charms, dex-auth and oidc-gatekeeper. The deployment works fine in Juju 2.7, and runs into a condition in Juju 2.8 where new units are rotated in, preventing the workload from actually running. Reproducible steps:

juju add-model kubeflow
juju deploy cs:~kubeflow-charmers/dex-auth-53
juju deploy cs:~kubeflow-charmers/oidc-gatekeeper-53
juju relate dex-auth oidc-gatekeeper
juju config oidc-gatekeeper client-secret=password
juju config dex-auth static-username=admin static-password=password
juju wait -wv
juju config dex-auth public-url=localhost
juju config oidc-gatekeeper public-url=localhost

The steps above work fine on Juju 2.8 until the last two, where `public-url` is configured for both charms. That's where Juju 2.8 starts constantly swapping in new units.

Ian Booth (wallyworld)
Changed in juju:
milestone: none → 2.8.3
status: New → Triaged
importance: Undecided → High
Revision history for this message
Ian Booth (wallyworld) wrote :

The k8s deployment controller for dex-auth keeps toggling between 0 and 1. For deployments, each time a new pod is created, that results in a new Juju unit, hence a new leader. So next step is to figure out what's causing the deployment to kill and restart pods.

In the Juju logs, there is a hook error logged twice

application-dex-auth: 11:58:50 ERROR unit.dex-auth/0.juju-log oidc-client:0: Hook error:
Traceback (most recent call last):
  File "lib/charms/reactive/__init__.py", line 74, in main
    bus.dispatch(restricted=restricted_mode)
  File "lib/charms/reactive/bus.py", line 390, in dispatch
    _invoke(other_handlers)
  File "lib/charms/reactive/bus.py", line 359, in _invoke
    handler.invoke()
  File "lib/charms/reactive/bus.py", line 181, in invoke
    self._action(*args)
  File "/var/lib/juju/agents/unit-dex-auth-0/charm/reactive/dex_auth.py", line 60, in start_charm
    oidc_client_info = endpoint_from_name('oidc-client').get_config()
  File "/var/lib/juju/agents/unit-dex-auth-0/charm/hooks/relations/oidc-client/requires.py", line 13, in get_config
    return [
  File "/var/lib/juju/agents/unit-dex-auth-0/charm/hooks/relations/oidc-client/requires.py", line 14, in <listcomp>
    json.loads(unit.received_raw["client_info"])
  File "/usr/lib/python3.8/json/__init__.py", line 341, in loads
    raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not NoneType

Revision history for this message
Ian Booth (wallyworld) wrote :

The issue is caused by the fact that the pod spec is not stable - the container name contains a hash of the config. So when a charm sends in a new pod spec and juju updates the k8s Deployment, that triggers a rolling update in which k8s creates a new pod and terminates the old one. Juju will react by deleting the existing unit and adding a new one for the new k8s pod and that new unit becomes leader. It seems this then results in a new pod spec being sent to Juju, and the k8s Deployment is updated because the pod spec has changed (it has the config hash in it) and rinse and repeat.

I stopped the flapping by making the container name stable (removing the config hash). I don't quite fully understand the comment in the charm code as to why the hash is needed. But we need to find a way around it to solve the issue.

Note that the oidc-gatekeeper charm does not have the same issue - the config change triggers an update, and Juju creates a new oidc-gatekeeper/1 unit which becomes leader, and that's how it remains. Making the dex-auth pod spec stable worked the same way.

Revision history for this message
Kenneth Koski (knkski) wrote :

Thanks for looking into this. The hash is required because we need the workload pod to get restarted, so that it will read an updated ConfigMap. Otherwise, the ConfigMap gets updated, which updates a configuration file in the pod's filesystem, but the dex service never re-reads the configuration file.

Revision history for this message
Ian Booth (wallyworld) wrote :

Can't the workload create a file watcher which reacts to file changes?

Revision history for this message
John A Meinel (jameinel) wrote :

I believe it *could* but if that isn't how the application container was written, then it doesn't actually do that.
It seems odd for the application to be written to consume from a ConfigMap but not actually watch the content of the ConfigMap (I've seen some cases where they expect something like SIGHUP as the way to reread from the filesystem).

Having a stable hash for the container is also a possibility. The juju config itself isn't changing, presumably it is other content that is ending up in the config hash. But is this IP addresses or unit names, or something else that we should be avoiding since it isn't stable?

Looking here:
https://github.com/juju-solutions/bundle-kubeflow/blob/master/charms/dex-auth/reactive/dex_auth.py#L77

a) 'static-config' uses uuid4() which means that every time it is evaluated it changes (so not very static)
If you do need a newly generated but not changing on every hook invocation uuid, then you would want to put that into either leader data, or into a peer relation for the application.
b) I don't know what is in oidc_client_info but that looks to be data that was from a relation, so it shouldn't be changing on a new unit

So it seems that it is just that your userID is dynamically generated each pass through this function, which causes it to get a new config hash. If you change how the user id is generated then the config should be stable and you should be able to append the config hash.

Revision history for this message
Kenneth Koski (knkski) wrote :

Got this issue correctly diagnosed as an issue with the Docker layer, see https://github.com/juju-solutions/layer-docker-resource/issues/6 for more details

Changed in juju:
status: Triaged → Invalid
Ian Booth (wallyworld)
Changed in juju:
milestone: 2.8.3 → none
Revision history for this message
Kenneth Koski (knkski) wrote :

This bug is actually either back, or I screwed up testing of the fix earlier. I'm able to reproduce this on 2.8/stable.

After reviewing the code for dex-auth, the reason that the dex-auth charm uses uuid4() is to force a change in the pod-spec, which then forces dex-auth to reload its configuration with the updated version from disk. See here for the uuid call:

https://github.com/juju-solutions/bundle-kubeflow/blob/master/charms/dex-auth/reactive/dex_auth.py#L84

Without the call to uuid4(), Dex never gets updated configuration from the relation to oidc-gatekeeper.

Thomas Miller (tlmiller)
Changed in juju:
assignee: nobody → Thomas Miller (tlmiller)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.