Continuous rotation of K8s charm units

Bug #1895598 reported by Kenneth Koski
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Thomas Miller
2.8
Fix Released
High
Thomas Miller

Bug Description

Similar overarching behavior to https://bugs.launchpad.net/juju/+bug/1892791, but adding the fix there (`clear_flag("layer.docker-resource.oci-image.available")`) didn't fix it in this case.

New units for argo-controller and pipelines-api are continuously rotated in as old ones die, as seen in this `juju status` output:

argo-controller/0* active idle 10.1.27.129
argo-controller/2 maintenance idle 10.1.27.106 fetching resource: oci-image
argo-controller/3 maintenance idle 10.1.27.106 fetching resource: oci-image
argo-controller/4 maintenance idle 10.1.27.106 fetching resource: oci-image
...
pipelines-api/0* terminated executing 10.1.27.134 8887/TCP,8888/TCP unit stopped by the cloud
pipelines-api/2 terminated executing 10.1.27.136 8887/TCP,8888/TCP unit stopped by the cloud
pipelines-api/3 maintenance executing 10.1.27.137 8887/TCP,8888/TCP fetching resource: oci-image

The issue seems to be stemming from attempting to exec a command in the workload pod:

application-argo-controller: 00:20:58 ERROR juju.worker.caasoperator exited "argo-controller/1": executing operation "remote init": caas-unit-init for unit "argo-controller/1" with command: "/var/lib/juju/tools/jujud caas-unit-init --unit unit-argo-controller-1 --charm-dir /tmp/unit-argo-controller-1301453043/charm --upgrade" failed: sh: 1: cd: can't cd to /var/lib/juju
sh: 1: /var/lib/juju/tools/jujud: not found

Attached is full `juju debug` output.

Curiously, it doesn't seem to start until I actually attempt to run test pipelines. If I deploy it and leave it alone, the units don't get rotated in and out. Also strangely, pipelines-api doesn't have the same errors about execing that argo-controller does, but is still rotated in and out.

If it helps debug at all, both charms have relations to the minio charm. However, there's another charm that also has a relation to the minio charm that isn't displaying this behavior (pipelines-ui).

Tags: k8s
Revision history for this message
Kenneth Koski (knkski) wrote :
Revision history for this message
Kenneth Koski (knkski) wrote :

To reproduce:

- Boot up new VM
- snap install microk8s, juju, juju-helpers, juju-wait, and charm
- apt install python3-pytest, and make a symlink somewhere in $PATH of pytest -> pytest-3
- git clone https://github.com/juju-solutions/bundle-kubeflow.git
- cd bundle-kubeflow/
- git clone git://git.launchpad.net/canonical-osm
- cp -r canonical-osm/charms/interfaces/juju-relation-mysql mysql
- python3 scripts/cli.py microk8s setup --test-mode
- python3 scripts/cli.py deploy-to uk8s --build
- ./tests/run.sh -m full

Ian Booth (wallyworld)
tags: added: k8s
Changed in juju:
status: New → Triaged
importance: Undecided → High
Harry Pidcock (hpidcock)
Changed in juju:
assignee: nobody → Harry Pidcock (hpidcock)
Revision history for this message
Harry Pidcock (hpidcock) wrote :

I think this might be fixed by https://github.com/juju/juju/pull/11840

Can we retest with 2.8.2?

Revision history for this message
Harry Pidcock (hpidcock) wrote :
Download full text (4.1 KiB)

In the controller logs I found this

controller-0: 02:14:19 ERROR juju.worker.dependency "caas-unit-provisioner" manifold worker returned unexpected error: resource name may not be empty
controller-0: 02:14:27 WARNING juju.kubernetes.provider Image parameter deprecated, use ImageDetails
controller-0: 02:14:29 WARNING juju.kubernetes.provider Image parameter deprecated, use ImageDetails
controller-0: 02:14:32 WARNING juju.kubernetes.provider Image parameter deprecated, use ImageDetails
controller-0: 02:15:00 WARNING juju.worker.httpserver http: TLS handshake error from 172.31.45.131:51480: read tcp 10.1.103.72:17070->172.31.45.131:51480: read: connection reset by peer

controller-0: 02:15:21 ERROR juju.apiserver error serving RPCs: codec.ReadHeader error: error receiving message: read tcp 10.1.103.72:17070->172.31.45.131:51720: read: connection reset by peer
controller-0: 02:16:51 ERROR juju.worker.dependency "caas-unit-provisioner" manifold worker returned unexpected error: creating or updating service account: attempt count exceeded: resource is still being deleted
controller-0: 02:17:00 WARNING juju.kubernetes.provider Image parameter deprecated, use ImageDetails
controller-0: 02:17:13 WARNING juju.kubernetes.provider Image parameter deprecated, use ImageDetails
controller-0: 02:17:14 WARNING juju.kubernetes.provider Image parameter deprecated, use ImageDetails
controller-0: 02:17:18 ERROR juju.apiserver error serving RPCs: codec.ReadHeader error: error receiving message: read tcp 10.1.103.72:17070->172.31.45.131:52974: read: connection reset by peer
controller-0: 02:19:09 WARNING juju.worker.httpserver http: TLS handshake error from 172.31.45.131:54058: read tcp 10.1.103.72:17070->172.31.45.131:54058: read: connection reset by peer

controller-0: 02:19:39 ERROR juju.worker.dependency "caas-unit-provisioner" manifold worker returned unexpected error: creating or updating service account: attempt count exceeded: resource is still being deleted
controller-0: 02:19:42 WARNING juju.worker.httpserver http: TLS handshake error from 172.31.45.131:54354: EOF

controller-0: 02:19:46 WARNING juju.kubernetes.provider Image parameter deprecated, use ImageDetails
controller-0: 02:19:47 WARNING juju.kubernetes.provider Image parameter deprecated, use ImageDetails
controller-0: 02:20:06 WARNING juju.kubernetes.provider Image parameter deprecated, use ImageDetails
controller-0: 02:20:07 WARNING juju.worker.httpserver http: TLS handshake error from 172.31.45.131:54576: EOF

controller-0: 02:22:04 ERROR juju.worker.dependency "caas-unit-provisioner" manifold worker returned unexpected error: creating or updating service account: attempt count exceeded: resource is still being deleted
controller-0: 02:22:13 WARNING juju.kubernetes.provider Image parameter deprecated, use ImageDetails
controller-0: 02:22:23 WARNING juju.kubernetes.provider Image parameter deprecated, use ImageDetails
controller-0: 02:22:24 WARNING juju.kubernetes.provider Image parameter deprecated, use ImageDetails
controller-0: 02:47:04 ERROR juju.core.raftlease command Command(ver: 1, op: claim, ns: application-leadership, model: a4b7b9, lease: argo-controller, holder: argo-controller/2): i...

Read more...

John A Meinel (jameinel)
Changed in juju:
assignee: Harry Pidcock (hpidcock) → Thomas Miller (tlmiller)
Revision history for this message
Kenneth Koski (knkski) wrote :

Git bisect lead to this commit being the cause of this bug:

https://github.com/juju/juju/commit/6c235e4dbbd30812f10a71a769a03e56d77bfe8e

Revision history for this message
Kenneth Koski (knkski) wrote :

Still seeing this issue with 2.8/edge. It's causing intermittent CI failures, because the `pipelines-api` charm will get bounced in the middle of tests, breaking them. Other charms are also randomly getting bounced, causing similar failures.

See here for an example of pipelines-api bouncing causing a CI failure:

https://github.com/juju-solutions/bundle-kubeflow/runs/1167502608?check_suite_focus=true

Revision history for this message
Kenneth Koski (knkski) wrote :

I was able to reproduce this after a number of attempts, attaching the logs for the pipelines-api operator.

Revision history for this message
Kenneth Koski (knkski) wrote :

I believe I've tracked this down to Juju's ordering of items in the ConfigMap that it mounts into the workload pod as files. As an example, here's snippets from two different calls that Juju does periodically to keep the Deployment in sync with what's been defined in the pod spec:

Snippet #1:
"configMap": {
    "name": "pipelines-api-samples-config",
    "items": [
        {
            "key": "parallel_join.yaml",
            "path": "parallel_join.yaml"
        }, {
            "key": "sequential.yaml",
            "path": "sequential.yaml"
        }, {
            "key": "xgboost_training_cm.yaml",
            "path": "xgboost_training_cm.yaml"
        }, {
            "key": "condition.yaml",
            "path": "condition.yaml"
        }, {
            "key": "exit_handler.yaml",
            "path": "exit_handler.yaml"
        }
    ],
    "defaultMode": 420
}

Snippet #2:
"configMap": {
    "name": "pipelines-api-samples-config",
    "items": [
        {
            "key": "sequential.yaml",
            "path": "sequential.yaml"
        }, {
            "key": "xgboost_training_cm.yaml",
            "path": "xgboost_training_cm.yaml"
        }, {
            "key": "condition.yaml",
            "path": "condition.yaml"
        }, {
            "key": "exit_handler.yaml",
            "path": "exit_handler.yaml"
        }, {
            "key": "parallel_join.yaml",
            "path": "parallel_join.yaml"
        }
    ],
    "defaultMode": 420
}

Notice how `parallel_join.yaml` has changed positions, which makes Kubernetes think the Deployment needs updating.

Revision history for this message
Kenneth Koski (knkski) wrote :

The full event for snippet #1

Revision history for this message
Kenneth Koski (knkski) wrote :

The full event for snippet #2

Revision history for this message
Kenneth Koski (knkski) wrote :

Additionally, I don't think I can solve this properly from within the charm. Here's where we generate the relevant pod spec:

https://github.com/juju-solutions/bundle-kubeflow/blob/f233435c/charms/pipelines-api/reactive/pipelines_api.py#L176-L183

Which isn't ordered. But since we're generating a dictionary that gets turned into YAML, which then in turn gets deserialized by Juju, we're reliant on Go's deserialization behavior. Go doesn't guarantee any ordering according to https://github.com/golang/go/issues/27179, so any ordering we specify in the pod spec yaml won't necessarily be reflected in what Juju/Go parse from that YAML.

Revision history for this message
Thomas Miller (tlmiller) wrote :
Changed in juju:
status: Triaged → In Progress
Thomas Miller (tlmiller)
Changed in juju:
status: In Progress → Fix Committed
John A Meinel (jameinel)
Changed in juju:
milestone: none → 2.9-beta1
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.