Bug #2015371 “[juju 3.2-beta1] Connection to controller fails: E...” : Bugs : Canonical Juju

Revision history for this message

Joseph Phillips (manadart) wrote on 2023-04-05 (last edit on 2023-04-05):

#1

The process of establishing HA has changed.

What you're seeing will be due to the initial controller bouncing for Dqlite node rebinding while the other controllers are also in the process of joining the cluster - there is a brief interval where the API will be inaccessible.

Joseph Phillips (manadart) on 2023-04-06

Changed in juju:
status:	New → Triaged
importance:	Undecided → High
assignee:	nobody → Joseph Phillips (manadart)

Joseph Phillips (manadart) on 2023-04-12

Changed in juju:
status:	Triaged → In Progress

Revision history for this message

Joseph Phillips (manadart) wrote on 2023-04-12:

#2

This might mitigate the downtime, but there's no way to avoid some churn during the establishment of HA.

https://github.com/juju/juju/pull/15468

It is likely that we'll have to adjust the test case expectations.

Joseph Phillips (manadart) on 2023-04-12

Changed in juju:
status:	In Progress → Fix Committed

Joseph Phillips (manadart) on 2023-04-12

Changed in juju:
milestone:	none → 3.2-beta2

Revision history for this message

John A Meinel (jameinel) wrote on 2023-04-12:

#3

Note that Solutions QA is seeing failures of this sort during our testing runs (in their testing of 3.2-beta1 they said that they had an error of this sort 3x during their test suite).

So we need to figure out how to preserve "CI doesn't fail during 'juju status' after enable-ha" in some fashion. Otherwise getting clean test runs for Openstack/CK/Kubeflow etc are all going to run into this, even if our own CI job is fixed.

If we have a clean pattern of how things should be tested, we can work with Solutions QA on that, but we need to have a clear "this is how we go into HA and how you test against it"

Revision history for this message

Bas de Bruijne (basdbruijne) wrote on 2023-04-12:

#4

We can update our CI to make the controller not being reachable during enable-ha expected behaviour. What is the maximum amount of time that you expect the controller to be unreachable? We currently check the status every 30 seconds, with the number of occurrences we have seen of this bug I expect it to be not much longer than that.

Revision history for this message

Joseph Phillips (manadart) wrote on 2023-05-22:

#5

We discovered another avenue for improving performance here.

https://github.com/juju/juju/pull/15619

This should take some 30 seconds off the time it takes to establish HA.

Joseph Phillips (manadart) on 2023-05-25

Changed in juju:
milestone:	3.2-beta2 → 3.2.1

Revision history for this message

Jeffrey Chang (modern911) wrote on 2023-06-21:

#6

Download full text (3.2 KiB)

I found similar error in a different run on different stage, https://solutions.qa.canonical.com/testruns/91506eb0-8447-41fa-90fe-c7a5898d64f2.
Running on AWS, and one of the EC2 instance(machine 14) lost connection to juju controller. But we are still able to collect logs on that instance.

2023-06-20 20:28:43 DEBUG jujuc server.go:213 running hook tool "status-set" for vault/0-etcd-relation-created-4597090498041969428
2023-06-20 20:31:14 ERROR juju.api monitor.go:59 health ping timed out after 30s
2023-06-20 20:32:24 ERROR juju.api monitor.go:59 health ping timed out after 30s
2023-06-20 20:31:14 ERROR juju.api monitor.go:59 health ping timed out after 30s
2023-06-20 20:31:00 ERROR juju.api monitor.go:59 health ping timed out after 30s
2023-06-20 20:31:19 ERROR juju.api monitor.go:59 health ping timed out after 30s
2023-06-20 20:35:44 DEBUG juju.worker.leadership tracker.go:227 filebeat/8 waiting for filebeat leadership release gave err: error blocking on leadership release: waiting for leadership cancelled by client
2023-06-20 20:35:44 DEBUG juju.rpc server.go:328 error closing codec: tls: failed to send closeNotify alert (but connection was closed anyway): write tcp 172.31.33.62:43424->172.31.33.161:17070: write: broken pipe
2023-06-20 20:31:09 ERROR juju.api monitor.go:59 health ping timed out after 30s
2023-06-20 20:35:44 DEBUG juju.rpc server.go:328 error closing codec: tls: failed to send closeNotify alert (but connection was closed anyway): write tcp 172.31.33.62:43362->172.31.33.161:17070: write: broken pipe
2023-06-20 20:35:44 DEBUG juju.worker.dependency engine.go:618 "migration-inactive-flag" manifold worker stopped: watcher has been stopped (stopped)
stack trace:
watcher has been stopped (stopped)
github.com/juju/juju/rpc.(*Conn).Call:178:
github.com/juju/juju/api.(*state).APICall:1251:
2023-06-20 20:35:44 ERROR juju.worker.dependency engine.go:695 "migration-inactive-flag" manifold worker returned unexpected error: watcher has been stopped (stopped)
2023-06-20 20:35:44 ERROR juju.api monitor.go:59 health ping timed out after 30s
2023-06-20 20:35:44 DEBUG juju.rpc server.go:328 error closing codec: tls: failed to send closeNotify alert (but connection was closed anyway): write tcp 172.31.33.62:41852->54.224.157.225:17070: write: broken pipe
2023-06-20 20:35:44 ERROR juju.worker.uniter.metrics listener.go:52 failed to close the collect-metrics listener: close unix /var/lib/juju/agents/unit-telegraf-8/4012084279/s: use of closed network connection
2023-06-20 20:35:44 ERROR juju.worker.uniter.metrics listener.go:52 failed to close the collect-metrics listener: close unix /var/lib/juju/agents/unit-ubuntu-advantage-8/911094120/s: use of closed network connection
2023-06-20 20:35:44 ERROR juju.api.watcher watcher.go:95 error trying to stop watcher: write tcp 172.31.33.62:43384->172.31.33.161:17070: write: broken pipe
2023-06-20 20:35:44 DEBUG juju.rpc server.go:328 error closing codec: tls: failed to send closeNotify alert (but connection was closed anyway): write tcp 172.31.33.62:33654->54.224.157.225:17070: write: broken pipe
2023-06-20 20:35:44 DEBUG juju.worker.dependency engine.go:618 "valid-credential-flag" manifold worker stop...

I found similar error in a different run on different stage, https://solutions.qa.canonical.com/testruns/91506eb0-8447-41fa-90fe-c7a5898d64f2.
Running on AWS, and one of the EC2 instance(machine 14) lost connection to juju controller. But we are still able to collect logs on that instance.

2023-06-20 20:28:43 DEBUG jujuc server.go:213 running hook tool "status-set" for vault/0-etcd-relation-created-4597090498041969428
2023-06-20 20:31:14 ERROR juju.api monitor.go:59 health ping timed out after 30s
2023-06-20 20:32:24 ERROR juju.api monitor.go:59 health ping timed out after 30s
2023-06-20 20:31:14 ERROR juju.api monitor.go:59 health ping timed out after 30s
2023-06-20 20:31:00 ERROR juju.api monitor.go:59 health ping timed out after 30s
2023-06-20 20:31:19 ERROR juju.api monitor.go:59 health ping timed out after 30s
2023-06-20 20:35:44 DEBUG juju.worker.leadership tracker.go:227 filebeat/8 waiting for filebeat leadership release gave err: error blocking on leadership release: waiting for leadership cancelled by client
2023-06-20 20:35:44 DEBUG juju.rpc server.go:328 error closing codec: tls: failed to send closeNotify alert (but connection was closed anyway): write tcp 172.31.33.62:43424->172.31.33.161:17070: write: broken pipe
2023-06-20 20:31:09 ERROR juju.api monitor.go:59 health ping timed out after 30s
2023-06-20 20:35:44 DEBUG juju.rpc server.go:328 error closing codec: tls: failed to send closeNotify alert (but connection was closed anyway): write tcp 172.31.33.62:43362->172.31.33.161:17070: write: broken pipe
2023-06-20 20:35:44 DEBUG juju.worker.dependency engine.go:618 "migration-inactive-flag" manifold worker stopped: watcher has been stopped (stopped)
stack trace:
watcher has been stopped (stopped)
github.com/juju/juju/rpc.(*Conn).Call:178: 
github.com/juju/juju/api.(*state).APICall:1251: 
2023-06-20 20:35:44 ERROR juju.worker.dependency engine.go:695 "migration-inactive-flag" manifold worker returned unexpected error: watcher has been stopped (stopped)
2023-06-20 20:35:44 ERROR juju.api monitor.go:59 health ping timed out after 30s
2023-06-20 20:35:44 DEBUG juju.rpc server.go:328 error closing codec: tls: failed to send closeNotify alert (but connection was closed anyway): write tcp 172.31.33.62:41852->54.224.157.225:17070: write: broken pipe
2023-06-20 20:35:44 ERROR juju.worker.uniter.metrics listener.go:52 failed to close the collect-metrics listener: close unix /var/lib/juju/agents/unit-telegraf-8/4012084279/s: use of closed network connection
2023-06-20 20:35:44 ERROR juju.worker.uniter.metrics listener.go:52 failed to close the collect-metrics listener: close unix /var/lib/juju/agents/unit-ubuntu-advantage-8/911094120/s: use of closed network connection
2023-06-20 20:35:44 ERROR juju.api.watcher watcher.go:95 error trying to stop watcher: write tcp 172.31.33.62:43384->172.31.33.161:17070: write: broken pipe
2023-06-20 20:35:44 DEBUG juju.rpc server.go:328 error closing codec: tls: failed to send closeNotify alert (but connection was closed anyway): write tcp 172.31.33.62:33654->54.224.157.225:17070: write: broken pipe
2023-06-20 20:35:44 DEBUG juju.worker.dependency engine.go:618 "valid-credential-flag" manifold worker stopped: watcher has been stopped (stopped)

John A Meinel (jameinel) on 2023-07-12

Changed in juju:
milestone:	3.2.1 → 3.2.2

Revision history for this message

Bas de Bruijne (basdbruijne) wrote on 2023-08-08:

#7

I am still seeing this in the 3.2.2 release testing (test run https://solutions.qa.canonical.com/testruns/29919bf8-9c75-4501-a42c-8d801387e34c/):

===========
2023-08-08-08:35:22 root DEBUG [localhost]: juju status -m foundations-maas:controller --format=yaml
2023-08-08-08:35:27 root ERROR [localhost] Command failed: juju status -m foundations-maas:controller --format=yaml
2023-08-08-08:35:27 root ERROR 1[localhost] STDOUT follows:
{}

2023-08-08-08:35:27 root ERROR 2[localhost] STDERR follows:
ERROR connection is shut down
===========

I can investigate later why this is happening. Are there any additional logs that we can provide?

Revision history for this message

Joseph Phillips (manadart) wrote on 2023-08-21:

#8

We are working with the Dqlite guys to work out why establishment of a cluster is taking longer than expected.

In the interim, this may help a little:
https://github.com/juju/juju/pull/16129

Canonical Juju QA Bot (juju-qa-bot) on 2023-08-21

Changed in juju:
status:	Fix Committed → Fix Released

Revision history for this message

Jeffrey Chang (modern911) wrote on 2023-09-07:

#9

Download full text (4.3 KiB)

We saw this in juju 3.2.3 candidate, (https://solutions.qa.canonical.com/testruns/a09bb7aa-342c-4470-b77e-670554f134c5)
It happened during HA setup.

2023-09-06-18:25:14 root DEBUG [localhost]: juju status -m foundations-maas:controller --format=yaml
2023-09-06-18:25:14 fce.juju_maas_controller DEBUG controller status: {'model': {'name': 'controller', 'type': 'iaas', 'controller': 'foundations-maas', 'cloud': 'maas_cloud', 'region': 'default', 'version': '3.2.3', 'model-status': {'current': 'available', 'since': '06 Sep 2023 18:19:00Z'}, 'sla': 'unsupported'}, 'machines': {'0': {'juju-status': {'current': 'started', 'since': '06 Sep 2023 18:19:06Z', 'version': '3.2.3'}, 'hostname': 'juju-28-1-6', 'dns-name': '10.246.165.84', 'ip-addresses': ['10.246.165.84'], 'instance-id': '8payah', 'display-name': 'juju-28-1-6', 'machine-status': {'current': 'running', 'message': 'Deployed', 'since': '06 Sep 2023 18:19:32Z'}, 'modification-status': {'current': 'idle', 'since': '06 Sep 2023 18:19:01Z'}, 'base': {'name': 'ubuntu', 'channel': '22.04'}, 'network-interfaces': {'ens4': {'ip-addresses': ['10.246.165.84'], 'mac-address': '52:54:00:3d:11:28', 'gateway': '10.246.164.1', 'space': 'oam-space', 'is-up': True}}, 'constraints': 'arch=amd64 mem=3584M tags=juju,silo3', 'hardware': 'arch=amd64 cores=4 mem=24576M tags=virtual,pod-console-logging,rack1,juju,silo3,juju-28-1 availability-zone=zone1', 'controller-member-status': 'has-vote', 'ha-primary': True}, '1': {'juju-status': {'current': 'pending', 'since': '06 Sep 2023 18:19:08Z'}, 'dns-name': '10.246.164.139', 'ip-addresses': ['10.246.164.139'], 'instance-id': 'b3wega', 'display-name': 'juju-30-1-8', 'machine-status': {'current': 'running', 'message': 'Deployed', 'since': '06 Sep 2023 18:23:44Z'}, 'modification-status': {'current': 'idle', 'since': '06 Sep 2023 18:19:08Z'}, 'base': {'name': 'ubuntu', 'channel': '22.04'}, 'constraints': 'arch=amd64 mem=3584M tags=juju,silo3', 'hardware': 'arch=amd64 cores=4 mem=24576M tags=virtual,pod-console-logging,rack1,juju,silo3,juju-30-1 availability-zone=zone3', 'controller-member-status': 'adding-vote'}, '2': {'juju-status': {'current': 'pending', 'since': '06 Sep 2023 18:19:08Z'}, 'dns-name': '10.246.166.202', 'ip-addresses': ['10.246.166.202'], 'instance-id': 'tt4c86', 'display-name': 'juju-29-1-7', 'machine-status': {'current': 'running', 'message': 'Deployed', 'since': '06 Sep 2023 18:23:44Z'}, 'modification-status': {'current': 'idle', 'since': '06 Sep 2023 18:19:08Z'}, 'base': {'name': 'ubuntu', 'channel': '22.04'}, 'constraints': 'arch=amd64 mem=3584M tags=juju,silo3', 'hardware': 'arch=amd64 cores=4 mem=24576M tags=virtual,pod-console-logging,rack1,juju,silo3,juju-29-1 availability-zone=zone2', 'controller-member-status': 'adding-vote'}}, 'applications': {'controller': {'charm': 'juju-controller', 'base': {'name': 'ubuntu', 'channel': '22.04'}, 'charm-origin': 'charmhub', 'charm-name': 'juju-controller', 'charm-rev': 14, 'charm-channel': '3.2/stable', 'exposed': False, 'application-status': {'current': 'waiting', 'message': 'waiting for machine', 'since': '06 Sep 2023 18:19:08Z'}, 'units': {'controller/0': {'workload-status': {'current': 'active', 'since': ...

We saw this in juju 3.2.3 candidate, (https://solutions.qa.canonical.com/testruns/a09bb7aa-342c-4470-b77e-670554f134c5)
It happened during HA setup.

2023-09-06-18:25:14 root DEBUG [localhost]: juju status -m foundations-maas:controller --format=yaml
2023-09-06-18:25:14 fce.juju_maas_controller DEBUG controller status: {'model': {'name': 'controller', 'type': 'iaas', 'controller': 'foundations-maas', 'cloud': 'maas_cloud', 'region': 'default', 'version': '3.2.3', 'model-status': {'current': 'available', 'since': '06 Sep 2023 18:19:00Z'}, 'sla': 'unsupported'}, 'machines': {'0': {'juju-status': {'current': 'started', 'since': '06 Sep 2023 18:19:06Z', 'version': '3.2.3'}, 'hostname': 'juju-28-1-6', 'dns-name': '10.246.165.84', 'ip-addresses': ['10.246.165.84'], 'instance-id': '8payah', 'display-name': 'juju-28-1-6', 'machine-status': {'current': 'running', 'message': 'Deployed', 'since': '06 Sep 2023 18:19:32Z'}, 'modification-status': {'current': 'idle', 'since': '06 Sep 2023 18:19:01Z'}, 'base': {'name': 'ubuntu', 'channel': '22.04'}, 'network-interfaces': {'ens4': {'ip-addresses': ['10.246.165.84'], 'mac-address': '52:54:00:3d:11:28', 'gateway': '10.246.164.1', 'space': 'oam-space', 'is-up': True}}, 'constraints': 'arch=amd64 mem=3584M tags=juju,silo3', 'hardware': 'arch=amd64 cores=4 mem=24576M tags=virtual,pod-console-logging,rack1,juju,silo3,juju-28-1 availability-zone=zone1', 'controller-member-status': 'has-vote', 'ha-primary': True}, '1': {'juju-status': {'current': 'pending', 'since': '06 Sep 2023 18:19:08Z'}, 'dns-name': '10.246.164.139', 'ip-addresses': ['10.246.164.139'], 'instance-id': 'b3wega', 'display-name': 'juju-30-1-8', 'machine-status': {'current': 'running', 'message': 'Deployed', 'since': '06 Sep 2023 18:23:44Z'}, 'modification-status': {'current': 'idle', 'since': '06 Sep 2023 18:19:08Z'}, 'base': {'name': 'ubuntu', 'channel': '22.04'}, 'constraints': 'arch=amd64 mem=3584M tags=juju,silo3', 'hardware': 'arch=amd64 cores=4 mem=24576M tags=virtual,pod-console-logging,rack1,juju,silo3,juju-30-1 availability-zone=zone3', 'controller-member-status': 'adding-vote'}, '2': {'juju-status': {'current': 'pending', 'since': '06 Sep 2023 18:19:08Z'}, 'dns-name': '10.246.166.202', 'ip-addresses': ['10.246.166.202'], 'instance-id': 'tt4c86', 'display-name': 'juju-29-1-7', 'machine-status': {'current': 'running', 'message': 'Deployed', 'since': '06 Sep 2023 18:23:44Z'}, 'modification-status': {'current': 'idle', 'since': '06 Sep 2023 18:19:08Z'}, 'base': {'name': 'ubuntu', 'channel': '22.04'}, 'constraints': 'arch=amd64 mem=3584M tags=juju,silo3', 'hardware': 'arch=amd64 cores=4 mem=24576M tags=virtual,pod-console-logging,rack1,juju,silo3,juju-29-1 availability-zone=zone2', 'controller-member-status': 'adding-vote'}}, 'applications': {'controller': {'charm': 'juju-controller', 'base': {'name': 'ubuntu', 'channel': '22.04'}, 'charm-origin': 'charmhub', 'charm-name': 'juju-controller', 'charm-rev': 14, 'charm-channel': '3.2/stable', 'exposed': False, 'application-status': {'current': 'waiting', 'message': 'waiting for machine', 'since': '06 Sep 2023 18:19:08Z'}, 'units': {'controller/0': {'workload-status': {'current': 'active', 'since': '06 Sep 2023 18:19:12Z'}, 'juju-status': {'current': 'idle', 'since': '06 Sep 2023 18:21:15Z', 'version': '3.2.3'}, 'leader': True, 'machine': '0', 'public-address': '10.246.165.84'}, 'controller/1': {'workload-status': {'current': 'waiting', 'message': 'waiting for machine', 'since': '06 Sep 2023 18:19:08Z'}, 'juju-status': {'current': 'allocating', 'since': '06 Sep 2023 18:19:08Z'}, 'machine': '1', 'public-address': '10.246.164.139'}, 'controller/2': {'workload-status': {'current': 'waiting', 'message': 'waiting for machine', 'since': '06 Sep 2023 18:19:08Z'}, 'juju-status': {'current': 'allocating', 'since': '06 Sep 2023 18:19:08Z'}, 'machine': '2', 'public-address': '10.246.166.202'}}, 'endpoint-bindings': {'': 'alpha', 'dashboard': 'alpha', 'website': 'alpha'}}}, 'storage': {}, 'controller': {'timestamp': '18:25:14Z'}}
2023-09-06-18:25:44 root DEBUG [localhost]: juju status -m foundations-maas:controller --format=yaml
2023-09-06-18:25:52 root ERROR [localhost] Command failed: juju status -m foundations-maas:controller --format=yaml
2023-09-06-18:25:52 root ERROR 1[localhost] STDOUT follows:
{}

2023-09-06-18:25:52 root ERROR 2[localhost] STDERR follows:
ERROR connection is shut down

Revision history for this message

John A Meinel (jameinel) wrote on 2023-09-07:

#10

So I believe the request from us was to retry a status during 'enable-ha'.

We can look at this, since it seems that normal `juju status` retries correctly, but `juju status --format=yaml` seems to include requests to poll additional data (IIRC relations and storage), and that seems to trigger this failure.

In the short term, we do still expect status to get interrupted while HA is being established, and this isn't a regression from 3.2.0

Canonical Juju

[juju 3.2-beta1] Connection to controller fails: ERROR connection is shut down

Bug Description

Other bug subscribers

Remote bug watches