Silent hook failure after MicroK8s restart

Bug #1900930 reported by Kenneth Koski
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju
High
Ian Booth

Bug Description

I'm trying to add a check in CI that ensures that tests pass both before and after a restart of the MicroK8s instance that Charmed Kubeflow is deployed on. I encountered a failure in one job while 11 other jobs worked, so it seems that this is a race condition. Additionally, I did not see any useful logging in the controller log.

Here is the failing CI run:

https://github.com/juju-solutions/bundle-kubeflow/runs/1288995933

I've attached the controller logs, here are the only relevant error lines I can find:

application-pytorch-operator: 20:53:10 DEBUG juju.kubernetes.provider.exec exec on pod "pytorch-operator-6974c76984-m47qt" for cmd ["sh" "-c" "cd /var/lib/juju; mkdir -p /tmp; echo $$ > /tmp/abhodxdu.pid; exec sh -c '/var/lib/juju/tools/jujud caas-unit-init --unit unit-pytorch-operator-0 --charm-dir /tmp/unit-pytorch-operator-0900121166/charm --send --operator-file /tmp/unit-pytorch-operator-0900121166/operator-client-cache.yaml --operator-ca-cert-file /tmp/unit-pytorch-operator-0900121166/ca.crt'; "]
application-pytorch-operator: 20:53:12 DEBUG juju.worker.uniter.operation committing operation "remote init" for pytorch-operator/0
application-pytorch-operator: 20:53:13 INFO juju.worker.uniter awaiting error resolution for "update-status" hook
application-pytorch-operator: 20:53:13 DEBUG juju.worker.uniter [AGENT-STATUS] error: hook failed: "update-status"

It's possible that the sh command is failing, but there's no indication that this is actually the case.

Tags: k8s Edit Tag help
Revision history for this message
Kenneth Koski (knkski) wrote :
Revision history for this message
Ian Booth (wallyworld) wrote :
Ian Booth (wallyworld)
tags: added: k8s
Changed in juju:
milestone: none → 2.8.7
status: New → Triaged
importance: Undecided → High
Revision history for this message
Ian Booth (wallyworld) wrote :

Looking at the log files in the artifacts recorded on the failed run, there's a bunch of operator agents which report they cannot connect to the controller:

/workspace/_build/src/github.com/juju/juju/worker/apicaller/manifold.go:97: [6a8c97] "application-minio" cannot open api
application-seldon-core: 20:43:24 DEBUG juju.worker.apicaller connecting with old password
application-pipelines-scheduledworkflow: 20:43:25 DEBUG juju.worker.apicaller [6a8c97] failed to connect
application-pipelines-scheduledworkflow: 20:43:25 DEBUG juju.worker.dependency "api-caller" manifold worker stopped: [6a8c97] "application-pipelines-scheduledworkflow" cannot open api: unable to connect to API: dial tcp 10.152.183.72:17070: i/o timeout
application-pipelines-scheduledworkflow: 20:43:25 DEBUG juju.worker.apicaller connecting with old password
application-argo-controller: 20:43:25 DEBUG juju.worker.apicaller connecting with old password
application-pipelines-db: 20:43:26 DEBUG juju.worker.apicaller connecting with old password
application-pipelines-persistence: 20:43:26 DEBUG juju.worker.apicaller connecting with old password
application-tf-job-operator: 20:43:27 DEBUG juju.worker.apicaller [6a8c97] failed to connect
application-tf-job-operator: 20:43:27 DEBUG juju.worker.dependency "api-caller" manifold worker stopped: [6a8c97] "application-tf-job-operator" cannot open api: unable to connect to API: dial tcp 10.152.183.72:17070: i/o timeout
application-tf-job-operator: 20:43:27 ERROR juju.worker.dependency "api-caller" manifold worker returned unexpected error: [6a8c97] "application-tf-job-operator" cannot open api: unable to connect to API: dial tcp 10.152.183.72:17070: i/o timeout
application-tf-job-operator: 20:43:27 DEBUG juju.worker.dependency stack trace:
dial tcp 10.152.183.72:17070: i/o timeout

I guess this is because after the microk8s restart, the cluster got a whole new IP address as did the controller service. Juju doesn't expect a single controller to change it's IP address; there's no way to notify distributed agents that has happened. HA is an answer because it allows the agents to still talk to other unaffected controllers and thus be informed that controller X has a new IP address. But even HA (if it were supported) won't work if every controller gets a new IP address at the same time.

Perhaps we need to look at using cluster dns and service fqdn, eg

controller-service.controller-<name>.svc.cluster.local

Revision history for this message
Ian Booth (wallyworld) wrote :

https://github.com/juju/juju/pull/12297

This PR will write out to the agent configuration file the hostname of the controller based on the cluster DNS, to give a non-IP address option to use to connect to the controller.

Changed in juju:
assignee: nobody → Ian Booth (wallyworld)
status: Triaged → In Progress
Ian Booth (wallyworld)
Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments