Juju wait loses connection after test_auth_file_propagation

Bug #1895177 reported by Joshua Genet
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Charmed Kubernetes Testing
Fix Released
High
Cory Johns

Bug Description

After the test has completed and is waiting for non-leader master units to remove the line it wrote to /root/cdk/basic_auth.csv, the juju wait call raises an exception. You can see at the bottom of my output below that AllWatcher sends Next, then has to send a handful of Pings, then eventually AllWatcher sends a Stop.

Note: This isn't using the most recent 1.19 bits.

---

Here are 4 different runs that hit this, all in our AWS environment:
(logs/artifacts at the bottom of the page)

https://solutions.qa.canonical.com/kubernetes/testRun/7c529861-5ef8-4e58-8a98-8313e712f54e
https://solutions.qa.canonical.com/kubernetes/testRun/0e5079d7-2ac8-4ce4-bea7-f5681f56bd1a
https://solutions.qa.canonical.com/kubernetes/testRun/cf319220-270d-4d4b-8f5c-a8aba089a165
https://solutions.qa.canonical.com/kubernetes/testRun/229585fa-2f85-4a27-b046-e87d884be993

---

2020-09-10 07:55:46.095683 DEBUG juju.client.connection:connection.py:464 connection 139786068180216 <- {'request-id': 50, 'response': {'deltas': [['action', 'change', {'model-uuid': 'b8cca727-f32d-467e-8c0b-01f592fd90b3', 'id': '28', 'receiver': 'kubernetes-master/1', 'name': 'juju-run', 'parameters': {'command': "sed -i '$d' /root/cdk/basic_auth.csv", 'timeout': 0, 'workload-context': False}, 'status': 'completed', 'message': '', 'results': {'Code': '0'}, 'enqueued': '2020-09-10T07:55:06Z', 'started': '2020-09-10T07:55:06Z', 'completed': '2020-09-10T07:55:06Z'}]]}}
2020-09-10 07:55:46.095690 DEBUG juju.model:model.py:937 Model changed: action change 28
2020-09-10 07:55:46.095696 DEBUG juju.client.connection:connection.py:441 connection 139786068180216 -> {
2020-09-10 07:55:46.095701 "type": "AllWatcher",
2020-09-10 07:55:46.095707 "request": "Next",
2020-09-10 07:55:46.095712 "version": 1,
2020-09-10 07:55:46.095718 "params": {},
2020-09-10 07:55:46.095724 "Id": "1",
2020-09-10 07:55:46.095729 "request-id": 51
2020-09-10 07:55:46.095735 }
2020-09-10 07:55:46.095742 DEBUG juju.client.connection:connection.py:464 connection 139786068180216 <- {'request-id': 51, 'response': {'deltas': [['unit', 'change', {'model-uuid': 'b8cca727-f32d-467e-8c0b-01f592fd90b3', 'name': 'kubernetes-master/1', 'application': 'kubernetes-master', 'series': 'focal', 'charm-url': 'cs:~containers/kubernetes-master-888', 'life': 'alive', 'public-address': '100.24.48.202', 'private-address': '172.31.40.192', 'machine-id': '13', 'ports': [{'protocol': 'tcp', 'number': 6443}], 'port-ranges': [{'from-port': 6443, 'to-port': 6443, 'protocol': 'tcp'}], 'principal': '', 'subordinate': False, 'workload-status': {'current': 'active', 'message': 'Kubernetes master running.', 'since': '2020-09-10T07:51:33.628964419Z', 'version': ''}, 'agent-status': {'current': 'idle', 'message': '', 'since': '2020-09-10T07:55:05.754316937Z', 'version': '2.8.1'}}]]}}
2020-09-10 07:55:46.095748 DEBUG juju.model:model.py:937 Model changed: unit change kubernetes-master/1
2020-09-10 07:55:46.095754 DEBUG juju.client.connection:connection.py:441 connection 139786068180216 -> {
2020-09-10 07:55:46.095759 "type": "AllWatcher",
2020-09-10 07:55:46.095830 "request": "Next",
2020-09-10 07:55:46.095839 "version": 1,
2020-09-10 07:55:46.095845 "params": {},
2020-09-10 07:55:46.095851 "Id": "1",
2020-09-10 07:55:46.095856 "request-id": 52
2020-09-10 07:55:46.095862 }
2020-09-10 07:55:46.095875 DEBUG juju.client.connection:connection.py:464 connection 139786068180216 <- {'request-id': 52, 'response': {'deltas': [['unit', 'change', {'model-uuid': 'b8cca727-f32d-467e-8c0b-01f592fd90b3', 'name': 'kubernetes-master/1', 'application': 'kubernetes-master', 'series': 'focal', 'charm-url': 'cs:~containers/kubernetes-master-888', 'life': 'alive', 'public-address': '100.24.48.202', 'private-address': '172.31.40.192', 'machine-id': '13', 'ports': [{'protocol': 'tcp', 'number': 6443}], 'port-ranges': [{'from-port': 6443, 'to-port': 6443, 'protocol': 'tcp'}], 'principal': '', 'subordinate': False, 'workload-status': {'current': 'active', 'message': 'Kubernetes master running.', 'since': '2020-09-10T07:51:33.628964419Z', 'version': ''}, 'agent-status': {'current': 'executing', 'message': 'running commands', 'since': '2020-09-10T07:55:05.826065139Z', 'version': '2.8.1'}}]]}}
2020-09-10 07:55:46.095883 DEBUG juju.model:model.py:937 Model changed: unit change kubernetes-master/1
2020-09-10 07:55:46.095888 DEBUG juju.client.connection:connection.py:441 connection 139786068180216 -> {
2020-09-10 07:55:46.095894 "type": "AllWatcher",
2020-09-10 07:55:46.095900 "request": "Next",
2020-09-10 07:55:46.095906 "version": 1,
2020-09-10 07:55:46.095912 "params": {},
2020-09-10 07:55:46.095917 "Id": "1",
2020-09-10 07:55:46.095923 "request-id": 53
2020-09-10 07:55:46.095929 }
2020-09-10 07:55:46.095936 DEBUG juju.client.connection:connection.py:464 connection 139786068180216 <- {'request-id': 53, 'response': {'deltas': [['unit', 'change', {'model-uuid': 'b8cca727-f32d-467e-8c0b-01f592fd90b3', 'name': 'kubernetes-master/1', 'application': 'kubernetes-master', 'series': 'focal', 'charm-url': 'cs:~containers/kubernetes-master-888', 'life': 'alive', 'public-address': '100.24.48.202', 'private-address': '172.31.40.192', 'machine-id': '13', 'ports': [{'protocol': 'tcp', 'number': 6443}], 'port-ranges': [{'from-port': 6443, 'to-port': 6443, 'protocol': 'tcp'}], 'principal': '', 'subordinate': False, 'workload-status': {'current': 'active', 'message': 'Kubernetes master running.', 'since': '2020-09-10T07:51:33.628964419Z', 'version': ''}, 'agent-status': {'current': 'idle', 'message': '', 'since': '2020-09-10T07:55:05.877028788Z', 'version': '2.8.1'}}]]}}
2020-09-10 07:55:46.095942 DEBUG juju.model:model.py:937 Model changed: unit change kubernetes-master/1
2020-09-10 07:55:46.095948 DEBUG juju.client.connection:connection.py:441 connection 139786068180216 -> {
2020-09-10 07:55:46.095954 "type": "AllWatcher",
2020-09-10 07:55:46.095960 "request": "Next",
2020-09-10 07:55:46.096027 "version": 1,
2020-09-10 07:55:46.096034 "params": {},
2020-09-10 07:55:46.096040 "Id": "1",
2020-09-10 07:55:46.096046 "request-id": 54
2020-09-10 07:55:46.096051 }
2020-09-10 07:55:46.096058 DEBUG juju.client.connection:connection.py:441 connection 139786068180216 -> {
2020-09-10 07:55:46.096064 "type": "Pinger",
2020-09-10 07:55:46.096070 "request": "Ping",
2020-09-10 07:55:46.096075 "version": 1,
2020-09-10 07:55:46.096081 "params": {},
2020-09-10 07:55:46.096087 "request-id": 55
2020-09-10 07:55:46.096092 }
2020-09-10 07:55:46.096098 DEBUG juju.client.connection:connection.py:441 connection 139786068180216 -> {
2020-09-10 07:55:46.096104 "type": "Pinger",
2020-09-10 07:55:46.096109 "request": "Ping",
2020-09-10 07:55:46.096115 "version": 1,
2020-09-10 07:55:46.096121 "params": {},
2020-09-10 07:55:46.096126 "request-id": 56
2020-09-10 07:55:46.096154 }
2020-09-10 07:55:46.096161 DEBUG juju.client.connection:connection.py:464 connection 139786068180216 <- {'request-id': 56, 'response': {}}
2020-09-10 07:55:46.096167 DEBUG juju.client.connection:connection.py:464 connection 139786068180216 <- {'request-id': 55, 'response': {}}
2020-09-10 07:55:46.096173 DEBUG juju.client.connection:connection.py:441 connection 139786068180216 -> {
2020-09-10 07:55:46.096179 "type": "Pinger",
2020-09-10 07:55:46.096184 "request": "Ping",
2020-09-10 07:55:46.096190 "version": 1,
2020-09-10 07:55:46.096196 "params": {},
2020-09-10 07:55:46.096201 "request-id": 57
2020-09-10 07:55:46.096207 }
2020-09-10 07:55:46.096212 DEBUG juju.client.connection:connection.py:441 connection 139786068180216 -> {
2020-09-10 07:55:46.096218 "type": "Pinger",
2020-09-10 07:55:46.096224 "request": "Ping",
2020-09-10 07:55:46.096229 "version": 1,
2020-09-10 07:55:46.096235 "params": {},
2020-09-10 07:55:46.096240 "request-id": 58
2020-09-10 07:55:46.096246 }
2020-09-10 07:55:46.096254 DEBUG juju.client.connection:connection.py:464 connection 139786068180216 <- {'request-id': 57, 'response': {}}
2020-09-10 07:55:46.096260 DEBUG juju.client.connection:connection.py:464 connection 139786068180216 <- {'request-id': 58, 'response': {}}
2020-09-10 07:55:46.096277 DEBUG juju.client.connection:connection.py:464 connection 139786068180216 <- {'request-id': 54, 'response': {'deltas': [['unit', 'change', {'model-uuid': 'b8cca727-f32d-467e-8c0b-01f592fd90b3', 'name': 'kubernetes-master/0', 'application': 'kubernetes-master', 'series': 'focal', 'charm-url': 'cs:~containers/kubernetes-master-888', 'life': 'alive', 'public-address': '54.80.173.22', 'private-address': '172.31.41.213', 'machine-id': '12', 'ports': [{'protocol': 'tcp', 'number': 6443}], 'port-ranges': [{'from-port': 6443, 'to-port': 6443, 'protocol': 'tcp'}], 'principal': '', 'subordinate': False, 'workload-status': {'current': 'active', 'message': 'Kubernetes master running.', 'since': '2020-09-10T07:49:41.394829415Z', 'version': ''}, 'agent-status': {'current': 'idle', 'message': '', 'since': '2020-09-10T07:55:25.285778737Z', 'version': '2.8.1'}}]]}}
2020-09-10 07:55:46.096285 DEBUG juju.model:model.py:937 Model changed: unit change kubernetes-master/0
2020-09-10 07:55:46.096290 DEBUG juju.client.connection:connection.py:441 connection 139786068180216 -> {
2020-09-10 07:55:46.096296 "type": "AllWatcher",
2020-09-10 07:55:46.096302 "request": "Next",
2020-09-10 07:55:46.096307 "version": 1,
2020-09-10 07:55:46.096313 "params": {},
2020-09-10 07:55:46.096318 "Id": "1",
2020-09-10 07:55:46.096324 "request-id": 59
2020-09-10 07:55:46.096330 }
2020-09-10 07:55:46.096335 DEBUG juju.client.connection:connection.py:441 connection 139786068180216 -> {
2020-09-10 07:55:46.096341 "type": "Pinger",
2020-09-10 07:55:46.096347 "request": "Ping",
2020-09-10 07:55:46.096352 "version": 1,
2020-09-10 07:55:46.096358 "params": {},
2020-09-10 07:55:46.096363 "request-id": 60
2020-09-10 07:55:46.096369 }
2020-09-10 07:55:46.096374 DEBUG juju.client.connection:connection.py:441 connection 139786068180216 -> {
2020-09-10 07:55:46.096380 "type": "Pinger",
2020-09-10 07:55:46.096386 "request": "Ping",
2020-09-10 07:55:46.096391 "version": 1,
2020-09-10 07:55:46.096397 "params": {},
2020-09-10 07:55:46.096402 "request-id": 61
2020-09-10 07:55:46.096411 }
2020-09-10 07:55:46.096417 DEBUG juju.client.connection:connection.py:464 connection 139786068180216 <- {'request-id': 61, 'response': {}}
2020-09-10 07:55:46.096423 DEBUG juju.client.connection:connection.py:464 connection 139786068180216 <- {'request-id': 60, 'response': {}}
2020-09-10 07:55:46.096429 DEBUG juju.client.connection:connection.py:441 connection 139786068180216 -> {
2020-09-10 07:55:46.096434 "type": "Pinger",
2020-09-10 07:55:46.096440 "request": "Ping",
2020-09-10 07:55:46.096446 "version": 1,
2020-09-10 07:55:46.096451 "params": {},
2020-09-10 07:55:46.096457 "request-id": 62
2020-09-10 07:55:46.096463 }
2020-09-10 07:55:46.096470 DEBUG juju.client.connection:connection.py:441 connection 139786068180216 -> {
2020-09-10 07:55:46.096476 "type": "Pinger",
2020-09-10 07:55:46.096481 "request": "Ping",
2020-09-10 07:55:46.096487 "version": 1,
2020-09-10 07:55:46.096493 "params": {},
2020-09-10 07:55:46.096572 "request-id": 63
2020-09-10 07:55:46.096580 }
2020-09-10 07:55:46.096586 DEBUG juju.client.connection:connection.py:464 connection 139786068180216 <- {'request-id': 62, 'response': {}}
2020-09-10 07:55:46.096592 DEBUG juju.client.connection:connection.py:464 connection 139786068180216 <- {'request-id': 63, 'response': {}}
2020-09-10 07:55:46.096598 ---------------------------- Captured log teardown -----------------------------
2020-09-10 07:55:46.096604 DEBUG juju.model:model.py:613 Stopping watcher task
2020-09-10 07:55:46.096610 DEBUG juju.client.connection:connection.py:441 connection 139786068180216 -> {
2020-09-10 07:55:46.096616 "type": "AllWatcher",
2020-09-10 07:55:46.096621 "request": "Stop",
2020-09-10 07:55:46.096627 "version": 1,
2020-09-10 07:55:46.096633 "params": {},
2020-09-10 07:55:46.096639 "Id": "1",
2020-09-10 07:55:46.096645 "request-id": 64
2020-09-10 07:55:46.096650 }
2020-09-10 07:55:46.096657 DEBUG juju.client.connection:connection.py:464 connection 139786068180216 <- {'request-id': 64, 'response': {}}
2020-09-10 07:55:46.096662 DEBUG juju.model:model.py:619 Closing model connection

Revision history for this message
Alexander Balderson (asbalderson) wrote :

Adding the release blocker tag, this is preventing all runs from completing validation

tags: added: cdo-release-blocker
Revision history for this message
Kevin W Monroe (kwmonroe) wrote :

I don't think the actual test is an issue here since "test_auth_file_propogation" just happens to be the first test [1] we run. I'm guessing this has more to do with juju-wait. Can you confirm this is only affecting AWS?

1: https://github.com/charmed-kubernetes/jenkins/blob/master/jobs/integration/validation.py#L194

Changed in charmed-kubernetes-testing:
importance: Undecided → High
status: New → Triaged
Revision history for this message
Adam Stokes (adam-stokes) wrote :

This is not something we've run into with our testing. What version of juju is this?

Revision history for this message
Joshua Genet (genet022) wrote :

We've only seen this on AWS so far. Alex is working on testing other substrates.
You can see versions of the software we're using for each run in my links. These are all using juju_2.8.1.

It could be an issue with the way the k8s tests call juju wait, but it also could be an issue with juju-wait itself.

Revision history for this message
Cory Johns (johnsca) wrote :

The DEBUG log messages in this report are not from juju-wait, they're from libjuju for the model connection that the test uses and the closed connection is actually a clean shutdown after the test fails due to juju-wait returning a non-zero exit code. The actual exception, which includes both stdout and stderr from juju-wait (or, more specifically, the lack thereof) can be found in the fce_build_7803_console.out file of the artifacts:

2020-09-10 07:55:45.842021 __________________________ test_auth_file_propagation __________________________
2020-09-10 07:55:45.842027 Traceback (most recent call last):
2020-09-10 07:55:45.842034 File "/home/ubuntu/k8s-validation/jobs/integration/validation.py", line 226, in test_auth_file_propagation
2020-09-10 07:55:45.842040 await tools.juju_wait()
2020-09-10 07:55:45.842050 File "/home/ubuntu/k8s-validation/jobs/integration/conftest.py", line 128, in juju_wait
2020-09-10 07:55:45.842056 return await self.run(cmd)
2020-09-10 07:55:45.842068 File "/home/ubuntu/k8s-validation/jobs/integration/conftest.py", line 117, in run
2020-09-10 07:55:45.842075 f"Problem with run command [{' '.join(cmd)}]: \nstdout: "
2020-09-10 07:55:45.842081 Exception: Problem with run command [juju wait -e foundations-aws:kubernetes -w]:
2020-09-10 07:55:45.842087 stdout:
2020-09-10 07:55:45.842093 stderr:

The relevant code can be found at: https://github.com/charmed-kubernetes/jenkins/blob/7b3b43371b4fc3330b97d3a0917badccb1f8fd6a/jobs/integration/conftest.py#L106-L120

Note that this means that juju-wait is returning a non-zero exit code but not providing *any* output on either stdout or stderr, and the arguments passed to juju-wait seem entirely normal.

Revision history for this message
Cory Johns (johnsca) wrote :

Since we have never encountered this and I'm not even sure how juju-wait could possibly fail without any output whatsoever, we're stuck on how to get anywhere on this. Would it be possible to have someone manually run juju-wait in one of those environments to try to get more info on why or how it's failing?

Revision history for this message
Alexander Balderson (asbalderson) wrote :

To catch up with the conversion had on IRC

We're going to test using the snap juju-wait instead of the pypi juju-wait which is a few versions behind the snap.

Since FCE runs juju-wait a few times before reaching the validation phase, it seems likely to work in this case as well.

Revision history for this message
Alexander Balderson (asbalderson) wrote :

After updating to the newer hash which uses the snap juju-wait we get the error:

Traceback (most recent call last): File "/home/ubuntu/k8s-validation/jobs/integration/validation.py", line 227, in test_auth_file_propagation await tools.juju_wait() File "/home/ubuntu/k8s-validation/jobs/integration/conftest.py", line 128, in juju_wait return await self.run(cmd) File "/home/ubuntu/k8s-validation/jobs/integration/conftest.py", line 117, in run f"Problem with run command [{' '.join(cmd)}]: \nstdout: " Exception: Problem with run command [juju wait -e foundations-aws:kubernetes -w]: stdout: stderr:

David Britton (dpb)
summary: - Juju wait losses connection after test_auth_file_propagation
+ Juju wait loses connection after test_auth_file_propagation
Revision history for this message
Cory Johns (johnsca) wrote :

George and I finally tracked this down to the uvloop alternative EventLoop implementation that the validation suite was using. Reverting that back to the standard asyncio.EventLoop fixed the issue in the FCE environment for us, so it's now been changed on commit ae6e6d4 (current master) of the CK Jenkins repo.

Changed in charmed-kubernetes-testing:
assignee: nobody → Cory Johns (johnsca)
status: Triaged → In Progress
milestone: none → 1.19
status: In Progress → Fix Committed
Revision history for this message
Alexander Balderson (asbalderson) wrote :

The commit from ae6e6d4 seems to have fixed this issue and has been picked up for the release test

Once there has been a few hash's on this hash then the 1.19 tag should probably be updated.

tags: removed: cdo-release-blocker
Changed in charmed-kubernetes-testing:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.