[2.9.23 candidate] connection is shut down

Bug #1957824 reported by Marian Gasparovic
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
High
Unassigned

Bug Description

During testing 2.9.23 candidate we encountered two failures during charm install hook which we did not see before.
Both failed with "ERROR connection is shut down"

Both failed when calling juju

subprocess.CalledProcessError: Command '['config-get', '--all', '--format=json']' returned non-zero exit status 1.

and

subprocess.CalledProcessError: Command '['leader-get', '--format=json', 'rndc_key']' returned non-zero exit status 1.

Links to artifacts

https://oil-jenkins.canonical.com/artifacts/2e001c7d-94fb-44b3-9716-505e98f11178/index.html

and

https://oil-jenkins.canonical.com/artifacts/5fc83154-f905-4065-8c8c-5ae2fccbcb23/index.html

Revision history for this message
John A Meinel (jameinel) wrote :

This is happening for charms, but https://github.com/juju/python-libjuju/issues/615 is something similar seen for clients (running pylibjuju). It is plausible that it isn't the same thing at all, it is just interesting to see some similar behavior.

Revision history for this message
Simon Richardson (simonrichardson) wrote :

Fixes panic: https://github.com/juju/juju/pull/13622 this might not fix the connection shut-down, but it's a panic in the logs nevertheless.

Do we know why we're performing an engine report during the tests that might cause a restart of units?

Revision history for this message
Alexander Balderson (asbalderson) wrote :

Simon, we run juju engine report as part of the crashdump collection after the run has failed and it shouldn't be run until after we've detected a failure (unusually when juju-wait detects an error and stops).

Also all the instances of this bug can be found at:
https://solutions.qa.canonical.com/bugs/bugs/bug/1957824

Revision history for this message
Alexander Balderson (asbalderson) wrote :

SQA his this another half dozen times over the last day of testing, adding the release-blocker tag

tags: added: cdo-release-blocker
Revision history for this message
Ian Booth (wallyworld) wrote :

Is this still happening in the 2.9.24 candidate?

Changed in juju:
status: New → Incomplete
Changed in juju:
status: Incomplete → Triaged
importance: Undecided → High
assignee: nobody → Yang Kelvin Liu (kelvin.liu)
milestone: none → 2.9.25
Revision history for this message
Yang Kelvin Liu (kelvin.liu) wrote :

I found this happens after mongo SYNC. After discussing with Ian, we think this issue might be the same/similar one we got in the prodstack(got timeout when mongo switches primary).
They mostly happen on the non-public cloud(probably low IOPS disk).
I would suggest we could have another run with the logging level set to WARNING.

Revision history for this message
Alexander Balderson (asbalderson) wrote :

I can provide some info about what we're running the juju controllers on where we see this.

The controllers are in KVMs with 24G memory, and 4 cores and 50G disk. The machines hosting the KVMs (3 per machine) are also running 3 vault KVMs (4G memory, 2 cores, 40G disk)

the hosts have 2 240g ssd's, 64G memory, and 4 core processor (3.4GHz). We over commit memory by 2x and cpu by 5x. I dont feel like we're pushing the machine to its limits with the over commit.

Changed in juju:
milestone: 2.9.25 → 2.9.26
Changed in juju:
milestone: 2.9.26 → 2.9.27
Changed in juju:
milestone: 2.9.27 → 2.9.28
Changed in juju:
milestone: 2.9.28 → 2.9.29
Changed in juju:
milestone: 2.9.29 → 2.9.30
Ian Booth (wallyworld)
Changed in juju:
milestone: 2.9.30 → 2.9-next
assignee: Yang Kelvin Liu (kelvin.liu) → nobody
Ian Booth (wallyworld)
Changed in juju:
milestone: 2.9-next → none
Revision history for this message
Marian Gasparovic (marosg) wrote :

We hit this again with 3.3 stable

  File "/var/lib/juju/agents/unit-kubeapi-load-balancer-0/charm/reactive/nginx.py", line 11, in <module>
    config = hookenv.config()
  File "/var/lib/juju/agents/unit-kubeapi-load-balancer-0/.venv/lib/python3.8/site-packages/charmhelpers/core/hookenv.py", line 444, in config
    subprocess.check_output(config_cmd_line).decode('UTF-8'))
  File "/usr/lib/python3.8/subprocess.py", line 415, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['config-get', '--all', '--format=json']' returned non-zero exit status 1.

Artifacts - https://oil-jenkins.canonical.com/artifacts/76af2731-6203-427a-9cd0-3db3071af831/index.html

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.