Canonical Juju

Bug #1900930
Comment #3

Comment 3 for bug 1900930

Revision history for this message

Ian Booth (wallyworld) wrote on 2020-11-11:

Looking at the log files in the artifacts recorded on the failed run, there's a bunch of operator agents which report they cannot connect to the controller:

/workspace/_build/src/github.com/juju/juju/worker/apicaller/manifold.go:97: [6a8c97] "application-minio" cannot open api
application-seldon-core: 20:43:24 DEBUG juju.worker.apicaller connecting with old password
application-pipelines-scheduledworkflow: 20:43:25 DEBUG juju.worker.apicaller [6a8c97] failed to connect
application-pipelines-scheduledworkflow: 20:43:25 DEBUG juju.worker.dependency "api-caller" manifold worker stopped: [6a8c97] "application-pipelines-scheduledworkflow" cannot open api: unable to connect to API: dial tcp 10.152.183.72:17070: i/o timeout
application-pipelines-scheduledworkflow: 20:43:25 DEBUG juju.worker.apicaller connecting with old password
application-argo-controller: 20:43:25 DEBUG juju.worker.apicaller connecting with old password
application-pipelines-db: 20:43:26 DEBUG juju.worker.apicaller connecting with old password
application-pipelines-persistence: 20:43:26 DEBUG juju.worker.apicaller connecting with old password
application-tf-job-operator: 20:43:27 DEBUG juju.worker.apicaller [6a8c97] failed to connect
application-tf-job-operator: 20:43:27 DEBUG juju.worker.dependency "api-caller" manifold worker stopped: [6a8c97] "application-tf-job-operator" cannot open api: unable to connect to API: dial tcp 10.152.183.72:17070: i/o timeout
application-tf-job-operator: 20:43:27 ERROR juju.worker.dependency "api-caller" manifold worker returned unexpected error: [6a8c97] "application-tf-job-operator" cannot open api: unable to connect to API: dial tcp 10.152.183.72:17070: i/o timeout
application-tf-job-operator: 20:43:27 DEBUG juju.worker.dependency stack trace:
dial tcp 10.152.183.72:17070: i/o timeout

I guess this is because after the microk8s restart, the cluster got a whole new IP address as did the controller service. Juju doesn't expect a single controller to change it's IP address; there's no way to notify distributed agents that has happened. HA is an answer because it allows the agents to still talk to other unaffected controllers and thus be informed that controller X has a new IP address. But even HA (if it were supported) won't work if every controller gets a new IP address at the same time.

Perhaps we need to look at using cluster dns and service fqdn, eg

controller-service.controller-<name>.svc.cluster.local

Looking at the log files in the artifacts recorded on the failed run, there's a bunch of operator agents which report they cannot connect to the controller:

Perhaps we need to look at using cluster dns and service fqdn, eg

controller-service.controller-<name>.svc.cluster.local