[juju 3.2-beta1] Connection to controller fails: ERROR connection is shut down

Bug #2015371 reported by Bas de Bruijne
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Joseph Phillips

Bug Description

In testrun https://solutions.qa.canonical.com/v2/testruns/07f77652-f841-45c4-afdd-6bfdc73b394f, we bootstrap a maas controller, call `juju enable-ha -c foundations-maas`, and then watch `juju status -m foundations-maas:controller --format=yaml` until the controller units are ready. After a while, the juju status command fails:

==========================
2023-04-05-04:59:05 root DEBUG [localhost]: juju status -m foundations-maas:controller --format=yaml
2023-04-05-04:59:14 root ERROR [localhost] Command failed: juju status -m foundations-maas:controller --format=yaml
2023-04-05-04:59:14 root ERROR [localhost] STDOUT follows:
{}

2023-04-05-04:59:14 root ERROR [localhost] STDERR follows:
ERROR connection is shut down
==========================

Looking at the controller logs (https://oil-jenkins.canonical.com/artifacts/07f77652-f841-45c4-afdd-6bfdc73b394f/generated/generated/juju_maas_controller/juju-crashdump-controller-2023-04-05-05.00.05.tar.gz), it shows a websocket issue:

==========================
2023-04-05 04:59:14 INFO juju.worker.uniter uniter.go:338 unit "controller/0" shutting down: catacomb 0xc0028db200 is dying
2023-04-05 04:59:14 INFO juju.worker.logger logger.go:136 logger worker stopped
2023-04-05 04:59:14 DEBUG juju.worker.dependency engine.go:618 "hook-retry-strategy" manifold worker stopped: watcher has been stopped (stopped)
stack trace:
watcher has been stopped (stopped)
github.com/juju/juju/rpc.(*Conn).Call:178:
github.com/juju/juju/api.(*state).APICall:1266:
2023-04-05 04:59:14 ERROR juju.worker.dependency engine.go:695 "hook-retry-strategy" manifold worker returned unexpected error: watcher has been stopped (stopped)
2023-04-05 04:59:14 DEBUG juju.worker.dependency engine.go:618 "migration-inactive-flag" manifold worker stopped: watcher has been stopped (stopped)
stack trace:
watcher has been stopped (stopped)
github.com/juju/juju/rpc.(*Conn).Call:178:
github.com/juju/juju/api.(*state).APICall:1266:
2023-04-05 04:59:14 ERROR juju.worker.dependency engine.go:695 "migration-inactive-flag" manifold worker returned unexpected error: watcher has been stopped (stopped)
2023-04-05 04:59:14 DEBUG juju.worker.dependency engine.go:618 "meter-status" manifold worker stopped: watcher has been stopped (stopped)
stack trace:
watcher has been stopped (stopped)
github.com/juju/juju/rpc.(*Conn).Call:178:
github.com/juju/juju/api.(*state).APICall:1266:
2023-04-05 04:59:14 DEBUG juju.worker.dependency engine.go:618 "log-sender" manifold worker stopped: sending log message: websocket: close 1006 (abnormal closure): unexpected EOF: use of closed network connection
stack trace:
use of closed network connection
github.com/juju/juju/api.(*DeadlineStream).WriteJSON:93:
github.com/juju/juju/api/logsender.(*writer).WriteLog:97: websocket: close 1006 (abnormal closure): unexpected EOF
github.com/juju/juju/api/logsender.(*writer).WriteLog:99: sending log message
github.com/juju/juju/worker/logsender.New.func1:69:
2023-04-05 04:59:14 ERROR juju.worker.dependency engine.go:695 "log-sender" manifold worker returned unexpected error: sending log message: websocket: close 1006 (abnormal closure): unexpected EOF: use of closed network connection
2023-04-05 04:59:14 DEBUG juju.worker.dependency engine.go:618 "uniter" manifold worker stopped: watcher has been stopped (stopped)
==========================

Crashdumps and configs can be found here:
https://oil-jenkins.canonical.com/artifacts/07f77652-f841-45c4-afdd-6bfdc73b394f/index.html

Revision history for this message
Joseph Phillips (manadart) wrote (last edit ):

The process of establishing HA has changed.

What you're seeing will be due to the initial controller bouncing for Dqlite node rebinding while the other controllers are also in the process of joining the cluster - there is a brief interval where the API will be inaccessible.

Changed in juju:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Joseph Phillips (manadart)
Changed in juju:
status: Triaged → In Progress
Revision history for this message
Joseph Phillips (manadart) wrote :

This might mitigate the downtime, but there's no way to avoid some churn during the establishment of HA.

https://github.com/juju/juju/pull/15468

It is likely that we'll have to adjust the test case expectations.

Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
milestone: none → 3.2-beta2
Revision history for this message
John A Meinel (jameinel) wrote :

Note that Solutions QA is seeing failures of this sort during our testing runs (in their testing of 3.2-beta1 they said that they had an error of this sort 3x during their test suite).

So we need to figure out how to preserve "CI doesn't fail during 'juju status' after enable-ha" in some fashion. Otherwise getting clean test runs for Openstack/CK/Kubeflow etc are all going to run into this, even if our own CI job is fixed.

If we have a clean pattern of how things should be tested, we can work with Solutions QA on that, but we need to have a clear "this is how we go into HA and how you test against it"

Revision history for this message
Bas de Bruijne (basdbruijne) wrote :

We can update our CI to make the controller not being reachable during enable-ha expected behaviour. What is the maximum amount of time that you expect the controller to be unreachable? We currently check the status every 30 seconds, with the number of occurrences we have seen of this bug I expect it to be not much longer than that.

Revision history for this message
Joseph Phillips (manadart) wrote :

We discovered another avenue for improving performance here.

https://github.com/juju/juju/pull/15619

This should take some 30 seconds off the time it takes to establish HA.

Changed in juju:
milestone: 3.2-beta2 → 3.2.1
Revision history for this message
Jeffrey Chang (modern911) wrote :
Download full text (3.2 KiB)

I found similar error in a different run on different stage, https://solutions.qa.canonical.com/testruns/91506eb0-8447-41fa-90fe-c7a5898d64f2.
Running on AWS, and one of the EC2 instance(machine 14) lost connection to juju controller. But we are still able to collect logs on that instance.

2023-06-20 20:28:43 DEBUG jujuc server.go:213 running hook tool "status-set" for vault/0-etcd-relation-created-4597090498041969428
2023-06-20 20:31:14 ERROR juju.api monitor.go:59 health ping timed out after 30s
2023-06-20 20:32:24 ERROR juju.api monitor.go:59 health ping timed out after 30s
2023-06-20 20:31:14 ERROR juju.api monitor.go:59 health ping timed out after 30s
2023-06-20 20:31:00 ERROR juju.api monitor.go:59 health ping timed out after 30s
2023-06-20 20:31:19 ERROR juju.api monitor.go:59 health ping timed out after 30s
2023-06-20 20:35:44 DEBUG juju.worker.leadership tracker.go:227 filebeat/8 waiting for filebeat leadership release gave err: error blocking on leadership release: waiting for leadership cancelled by client
2023-06-20 20:35:44 DEBUG juju.rpc server.go:328 error closing codec: tls: failed to send closeNotify alert (but connection was closed anyway): write tcp 172.31.33.62:43424->172.31.33.161:17070: write: broken pipe
2023-06-20 20:31:09 ERROR juju.api monitor.go:59 health ping timed out after 30s
2023-06-20 20:35:44 DEBUG juju.rpc server.go:328 error closing codec: tls: failed to send closeNotify alert (but connection was closed anyway): write tcp 172.31.33.62:43362->172.31.33.161:17070: write: broken pipe
2023-06-20 20:35:44 DEBUG juju.worker.dependency engine.go:618 "migration-inactive-flag" manifold worker stopped: watcher has been stopped (stopped)
stack trace:
watcher has been stopped (stopped)
github.com/juju/juju/rpc.(*Conn).Call:178:
github.com/juju/juju/api.(*state).APICall:1251:
2023-06-20 20:35:44 ERROR juju.worker.dependency engine.go:695 "migration-inactive-flag" manifold worker returned unexpected error: watcher has been stopped (stopped)
2023-06-20 20:35:44 ERROR juju.api monitor.go:59 health ping timed out after 30s
2023-06-20 20:35:44 DEBUG juju.rpc server.go:328 error closing codec: tls: failed to send closeNotify alert (but connection was closed anyway): write tcp 172.31.33.62:41852->54.224.157.225:17070: write: broken pipe
2023-06-20 20:35:44 ERROR juju.worker.uniter.metrics listener.go:52 failed to close the collect-metrics listener: close unix /var/lib/juju/agents/unit-telegraf-8/4012084279/s: use of closed network connection
2023-06-20 20:35:44 ERROR juju.worker.uniter.metrics listener.go:52 failed to close the collect-metrics listener: close unix /var/lib/juju/agents/unit-ubuntu-advantage-8/911094120/s: use of closed network connection
2023-06-20 20:35:44 ERROR juju.api.watcher watcher.go:95 error trying to stop watcher: write tcp 172.31.33.62:43384->172.31.33.161:17070: write: broken pipe
2023-06-20 20:35:44 DEBUG juju.rpc server.go:328 error closing codec: tls: failed to send closeNotify alert (but connection was closed anyway): write tcp 172.31.33.62:33654->54.224.157.225:17070: write: broken pipe
2023-06-20 20:35:44 DEBUG juju.worker.dependency engine.go:618 "valid-credential-flag" manifold worker stop...

Read more...

John A Meinel (jameinel)
Changed in juju:
milestone: 3.2.1 → 3.2.2
Revision history for this message
Bas de Bruijne (basdbruijne) wrote :

I am still seeing this in the 3.2.2 release testing (test run https://solutions.qa.canonical.com/testruns/29919bf8-9c75-4501-a42c-8d801387e34c/):

===========
2023-08-08-08:35:22 root DEBUG [localhost]: juju status -m foundations-maas:controller --format=yaml
2023-08-08-08:35:27 root ERROR [localhost] Command failed: juju status -m foundations-maas:controller --format=yaml
2023-08-08-08:35:27 root ERROR 1[localhost] STDOUT follows:
{}

2023-08-08-08:35:27 root ERROR 2[localhost] STDERR follows:
ERROR connection is shut down
===========

I can investigate later why this is happening. Are there any additional logs that we can provide?

Revision history for this message
Joseph Phillips (manadart) wrote :

We are working with the Dqlite guys to work out why establishment of a cluster is taking longer than expected.

In the interim, this may help a little:
https://github.com/juju/juju/pull/16129

Changed in juju:
status: Fix Committed → Fix Released
Revision history for this message
Jeffrey Chang (modern911) wrote :
Download full text (4.3 KiB)

We saw this in juju 3.2.3 candidate, (https://solutions.qa.canonical.com/testruns/a09bb7aa-342c-4470-b77e-670554f134c5)
It happened during HA setup.

2023-09-06-18:25:14 root DEBUG [localhost]: juju status -m foundations-maas:controller --format=yaml
2023-09-06-18:25:14 fce.juju_maas_controller DEBUG controller status: {'model': {'name': 'controller', 'type': 'iaas', 'controller': 'foundations-maas', 'cloud': 'maas_cloud', 'region': 'default', 'version': '3.2.3', 'model-status': {'current': 'available', 'since': '06 Sep 2023 18:19:00Z'}, 'sla': 'unsupported'}, 'machines': {'0': {'juju-status': {'current': 'started', 'since': '06 Sep 2023 18:19:06Z', 'version': '3.2.3'}, 'hostname': 'juju-28-1-6', 'dns-name': '10.246.165.84', 'ip-addresses': ['10.246.165.84'], 'instance-id': '8payah', 'display-name': 'juju-28-1-6', 'machine-status': {'current': 'running', 'message': 'Deployed', 'since': '06 Sep 2023 18:19:32Z'}, 'modification-status': {'current': 'idle', 'since': '06 Sep 2023 18:19:01Z'}, 'base': {'name': 'ubuntu', 'channel': '22.04'}, 'network-interfaces': {'ens4': {'ip-addresses': ['10.246.165.84'], 'mac-address': '52:54:00:3d:11:28', 'gateway': '10.246.164.1', 'space': 'oam-space', 'is-up': True}}, 'constraints': 'arch=amd64 mem=3584M tags=juju,silo3', 'hardware': 'arch=amd64 cores=4 mem=24576M tags=virtual,pod-console-logging,rack1,juju,silo3,juju-28-1 availability-zone=zone1', 'controller-member-status': 'has-vote', 'ha-primary': True}, '1': {'juju-status': {'current': 'pending', 'since': '06 Sep 2023 18:19:08Z'}, 'dns-name': '10.246.164.139', 'ip-addresses': ['10.246.164.139'], 'instance-id': 'b3wega', 'display-name': 'juju-30-1-8', 'machine-status': {'current': 'running', 'message': 'Deployed', 'since': '06 Sep 2023 18:23:44Z'}, 'modification-status': {'current': 'idle', 'since': '06 Sep 2023 18:19:08Z'}, 'base': {'name': 'ubuntu', 'channel': '22.04'}, 'constraints': 'arch=amd64 mem=3584M tags=juju,silo3', 'hardware': 'arch=amd64 cores=4 mem=24576M tags=virtual,pod-console-logging,rack1,juju,silo3,juju-30-1 availability-zone=zone3', 'controller-member-status': 'adding-vote'}, '2': {'juju-status': {'current': 'pending', 'since': '06 Sep 2023 18:19:08Z'}, 'dns-name': '10.246.166.202', 'ip-addresses': ['10.246.166.202'], 'instance-id': 'tt4c86', 'display-name': 'juju-29-1-7', 'machine-status': {'current': 'running', 'message': 'Deployed', 'since': '06 Sep 2023 18:23:44Z'}, 'modification-status': {'current': 'idle', 'since': '06 Sep 2023 18:19:08Z'}, 'base': {'name': 'ubuntu', 'channel': '22.04'}, 'constraints': 'arch=amd64 mem=3584M tags=juju,silo3', 'hardware': 'arch=amd64 cores=4 mem=24576M tags=virtual,pod-console-logging,rack1,juju,silo3,juju-29-1 availability-zone=zone2', 'controller-member-status': 'adding-vote'}}, 'applications': {'controller': {'charm': 'juju-controller', 'base': {'name': 'ubuntu', 'channel': '22.04'}, 'charm-origin': 'charmhub', 'charm-name': 'juju-controller', 'charm-rev': 14, 'charm-channel': '3.2/stable', 'exposed': False, 'application-status': {'current': 'waiting', 'message': 'waiting for machine', 'since': '06 Sep 2023 18:19:08Z'}, 'units': {'controller/0': {'workload-status': {'current': 'active', 'since': ...

Read more...

Revision history for this message
John A Meinel (jameinel) wrote :

So I believe the request from us was to retry a status during 'enable-ha'.

We can look at this, since it seems that normal `juju status` retries correctly, but `juju status --format=yaml` seems to include requests to poll additional data (IIRC relations and storage), and that seems to trigger this failure.

In the short term, we do still expect status to get interrupted while HA is being established, and this isn't a regression from 3.2.0

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.