Looking at the log files in the artifacts recorded on the failed run, there's a bunch of operator agents which report they cannot connect to the controller:
/workspace/_build/src/github.com/juju/juju/worker/apicaller/manifold.go:97: [6a8c97] "application-minio" cannot open api
application-seldon-core: 20:43:24 DEBUG juju.worker.apicaller connecting with old password
application-pipelines-scheduledworkflow: 20:43:25 DEBUG juju.worker.apicaller [6a8c97] failed to connect
application-pipelines-scheduledworkflow: 20:43:25 DEBUG juju.worker.dependency "api-caller" manifold worker stopped: [6a8c97] "application-pipelines-scheduledworkflow" cannot open api: unable to connect to API: dial tcp 10.152.183.72:17070: i/o timeout
application-pipelines-scheduledworkflow: 20:43:25 DEBUG juju.worker.apicaller connecting with old password
application-argo-controller: 20:43:25 DEBUG juju.worker.apicaller connecting with old password
application-pipelines-db: 20:43:26 DEBUG juju.worker.apicaller connecting with old password
application-pipelines-persistence: 20:43:26 DEBUG juju.worker.apicaller connecting with old password
application-tf-job-operator: 20:43:27 DEBUG juju.worker.apicaller [6a8c97] failed to connect
application-tf-job-operator: 20:43:27 DEBUG juju.worker.dependency "api-caller" manifold worker stopped: [6a8c97] "application-tf-job-operator" cannot open api: unable to connect to API: dial tcp 10.152.183.72:17070: i/o timeout
application-tf-job-operator: 20:43:27 ERROR juju.worker.dependency "api-caller" manifold worker returned unexpected error: [6a8c97] "application-tf-job-operator" cannot open api: unable to connect to API: dial tcp 10.152.183.72:17070: i/o timeout
application-tf-job-operator: 20:43:27 DEBUG juju.worker.dependency stack trace:
dial tcp 10.152.183.72:17070: i/o timeout
I guess this is because after the microk8s restart, the cluster got a whole new IP address as did the controller service. Juju doesn't expect a single controller to change it's IP address; there's no way to notify distributed agents that has happened. HA is an answer because it allows the agents to still talk to other unaffected controllers and thus be informed that controller X has a new IP address. But even HA (if it were supported) won't work if every controller gets a new IP address at the same time.
Perhaps we need to look at using cluster dns and service fqdn, eg
Looking at the log files in the artifacts recorded on the failed run, there's a bunch of operator agents which report they cannot connect to the controller:
/workspace/ _build/ src/github. com/juju/ juju/worker/ apicaller/ manifold. go:97: [6a8c97] "application-minio" cannot open api seldon- core: 20:43:24 DEBUG juju.worker. apicaller connecting with old password pipelines- scheduledworkfl ow: 20:43:25 DEBUG juju.worker. apicaller [6a8c97] failed to connect pipelines- scheduledworkfl ow: 20:43:25 DEBUG juju.worker. dependency "api-caller" manifold worker stopped: [6a8c97] "application- pipelines- scheduledworkfl ow" cannot open api: unable to connect to API: dial tcp 10.152. 183.72: 17070: i/o timeout pipelines- scheduledworkfl ow: 20:43:25 DEBUG juju.worker. apicaller connecting with old password argo-controller : 20:43:25 DEBUG juju.worker. apicaller connecting with old password pipelines- db: 20:43:26 DEBUG juju.worker. apicaller connecting with old password pipelines- persistence: 20:43:26 DEBUG juju.worker. apicaller connecting with old password tf-job- operator: 20:43:27 DEBUG juju.worker. apicaller [6a8c97] failed to connect tf-job- operator: 20:43:27 DEBUG juju.worker. dependency "api-caller" manifold worker stopped: [6a8c97] "application- tf-job- operator" cannot open api: unable to connect to API: dial tcp 10.152. 183.72: 17070: i/o timeout tf-job- operator: 20:43:27 ERROR juju.worker. dependency "api-caller" manifold worker returned unexpected error: [6a8c97] "application- tf-job- operator" cannot open api: unable to connect to API: dial tcp 10.152. 183.72: 17070: i/o timeout tf-job- operator: 20:43:27 DEBUG juju.worker. dependency stack trace: 183.72: 17070: i/o timeout
application-
application-
application-
application-
application-
application-
application-
application-
application-
application-
application-
dial tcp 10.152.
I guess this is because after the microk8s restart, the cluster got a whole new IP address as did the controller service. Juju doesn't expect a single controller to change it's IP address; there's no way to notify distributed agents that has happened. HA is an answer because it allows the agents to still talk to other unaffected controllers and thus be informed that controller X has a new IP address. But even HA (if it were supported) won't work if every controller gets a new IP address at the same time.
Perhaps we need to look at using cluster dns and service fqdn, eg
controller- service. controller- <name>. svc.cluster. local