juju-status error from controller model

Bug #1895701 reported by John George
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Expired
High
Unassigned

Bug Description

SolutionsQA tests failed when calling juju status on the controller model just after bootstrap.
We can see from the bootstrap output that the controller reported itself available:

Bootstrap agent now started
Contacting Juju controller at 10.244.32.115 to verify accessibility...

Bootstrap complete, controller "foundation-openstack" is now available
Controller machines are in the "controller" model
Initial model "default" added

2020-09-15-09:28:40 root ERROR [localhost] Command failed: juju status -m foundation-openstack:controller --format=yaml
2020-09-15-09:28:40 root ERROR [localhost] STDOUT follows:
{}

subprocess.CalledProcessError: Command '['juju', 'status', '-m', 'foundation-openstack:controller', '--format=yaml']' returned non-zero exit status 1.

The test harness subsequently called juju status against the same controller without it raising an error, so possibly a timing issue?

We saw this on two runs:
https://solutions.qa.canonical.com/openstack/testRun/824fe1ed-1181-42c9-9f93-bcc2bafce127
https://solutions.qa.canonical.com/openstack/testRun/ea75d9af-5dae-4cb1-b80d-b68924eca1e7

Revision history for this message
Pen Gale (pengale) wrote :

I tried to recreate this by bootstrapping a controller and quickly calling status on it, and wasn't able to do so.

I suspect that the error is local to the machine executing Juju run. Can you capture stdout from Python's subprocess model? (You can also get more information about the local run by passing --debug to the juju run command.)

Changed in juju:
status: New → Incomplete
Revision history for this message
Pen Gale (pengale) wrote :

Whoops. I mean "juju status" in the above. (Had two bugs open at once.)

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

The stdout is there:

2020-09-15-09:28:40 root ERROR [localhost] STDOUT follows:
{}

Michael Skalka (mskalka)
Changed in juju:
status: Incomplete → New
Revision history for this message
Pen Gale (pengale) wrote :

Triaging as high and adding to the next cycle's milestone. It's still unclear why this is happening, as we haven't been able to reproduce in a Juju facing test environment. But I'd like to see us make another go at it, as part of the push to clean up intermittent errors in Solutions QA.

Changed in juju:
status: New → Triaged
importance: Undecided → High
milestone: none → 3.0.0
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Sub'd to field high as this is one of our most common juju failure modes.

Revision history for this message
Michael Skalka (mskalka) wrote :

Saw this again in this test run: https://solutions.qa.canonical.com/testruns/testRun/be5c6d38-d0ac-40c4-89b2-2f730e88567b

The controller bootstraps, we reach it then try to read the status on the new controller and it immediately fails:

2020-11-17 07:55:34.477345 Launching controller instance(s) on openstack_cloud/RegionOne...
2020-11-17 07:56:20.560529 - instance "405d3b2c-3cdf-4613-a935-1f9425b531ef" has status BUILD, wait 10 seconds before retry, attempt 1
 - instance "405d3b2c-3cdf-4613-a935-1f9425b531ef" has status BUILD, wait 10 seconds before retry, attempt 2
 - instance "405d3b2c-3cdf-4613-a935-1f9425b531ef" has status BUILD, wait 10 seconds before retry, attempt 3
 - 405d3b2c-3cdf-4613-a935-1f9425b531ef (arch=amd64 mem=4G cores=2)
2020-11-17 07:56:20.560628 Installing Juju agent on bootstrap instance
2020-11-17 07:56:22.976989 Fetching Juju Dashboard 0.3.0
2020-11-17 07:56:22.977071 Waiting for address
2020-11-17 07:56:24.006900 Attempting to connect to 10.244.33.173:22
2020-11-17 07:56:24.007005 Attempting to connect to 172.16.0.118:22
2020-11-17 07:56:54.240663 Connected to 10.244.33.173
2020-11-17 07:56:54.245596 Running machine configuration script...
2020-11-17 07:58:48.900200 Bootstrap agent now started
2020-11-17 07:58:50.430647 Contacting Juju controller at 10.244.33.173 to verify accessibility...
2020-11-17 07:59:06.499699
2020-11-17 07:59:06.499756 Bootstrap complete, controller "foundation-openstack" is now available
2020-11-17 07:59:06.499775 Controller machines are in the "controller" model
2020-11-17 07:59:06.499803 Initial model "default" added

Artifacts here: https://oil-jenkins.canonical.com/artifacts/be5c6d38-d0ac-40c4-89b2-2f730e88567b/index.html

Revision history for this message
Michael Skalka (mskalka) wrote :

Lost the tail end of the log in that last comment:

2020-11-17 07:59:06.499756 Bootstrap complete, controller "foundation-openstack" is now available
2020-11-17 07:59:06.499775 Controller machines are in the "controller" model
2020-11-17 07:59:06.499803 Initial model "default" added
Traceback (most recent call last):
  File "/usr/local/bin/fce", line 11, in <module>
    load_entry_point('foundationcloudengine', 'console_scripts', 'fce')()
  File "/home/ubuntu/cpe/foundation/foundationcloudengine/foundationcloudengine/main.py", line 170, in entry_point
    sys.exit(main(sys.argv[1:]))
  File "/home/ubuntu/cpe/foundation/foundationcloudengine/foundationcloudengine/main.py", line 161, in main
    opts.func(opts)
  File "/home/ubuntu/cpe/foundation/foundationcloudengine/foundationcloudengine/build.py", line 113, in build_main
    layer, not args.skip_validate, args.force_rebuild, args.steps
  File "/home/ubuntu/cpe/foundation/foundationcloudengine/foundationcloudengine/build.py", line 75, in build_and_validate_if_needed
    layer.build_outer(only_steps)
  File "/home/ubuntu/cpe/foundation/foundationcloudengine/foundationcloudengine/layers/baselayer.py", line 122, in build_outer
    self.build(only_steps=only_steps)
  File "/home/ubuntu/cpe/foundation/foundationcloudengine/foundationcloudengine/layers/jujucontrollerlayer.py", line 362, in build
    self.run_steps(only_steps)
  File "/home/ubuntu/cpe/foundation/foundationcloudengine/foundationcloudengine/layers/steppedbaselayer.py", line 114, in run_steps
    step.build()
  File "/home/ubuntu/cpe/foundation/foundationcloudengine/foundationcloudengine/layers/jujucontrollerlayer.py", line 225, in build
    controller_status = status(model)
  File "/home/ubuntu/cpe/foundation/foundationcloudengine/foundationcloudengine/juju_cli.py", line 251, in status
    return yaml.safe_load(local(command, stderr=subprocess.DEVNULL))
  File "/home/ubuntu/cpe/foundation/foundationcloudengine/foundationcloudengine/remotehelpers.py", line 209, in local
    return run_cmd(cmd, target_machine, **kwargs)
  File "/home/ubuntu/cpe/foundation/foundationcloudengine/foundationcloudengine/remotehelpers.py", line 147, in run_cmd
    raise subprocess.CalledProcessError(ps.returncode, cmd, output=out, stderr=err)
subprocess.CalledProcessError: Command '['juju', 'status', '-m', 'foundation-openstack:controller', '--format=yaml']' returned non-zero exit status 1.
Build step 'Execute shell' marked build as failure

Revision history for this message
Pen Gale (pengale) wrote :

@mskalka: can you update the script called juju status to capture STDOUT and STDERR? I still can't replicate this on my end, and I wonder if it isn't as simple as juju status returning an error because the controller isn't yet ready to respond.

Revision history for this message
Joshua Genet (genet022) wrote :

Hit this bug and just messing around with it, its very sporadic. I started just spamming the cmd below. It worked once, then failed 6 times, then worked 3 times, then back to failing. All within 5mins.

Here's the output from a failure:
$ juju status -m foundation-openstack:controller --format=yaml
{}
ERROR could not fetch machines: cannot get all instance data for model: read tcp 252.223.0.1:49770->252.92.0.1:37017: i/o timeout

$ juju status -m foundation-openstack:controller
ERROR could not fetch machines: cannot get all instance data for model: read tcp 252.198.0.1:39924->252.92.0.1:37017: i/o timeout

Revision history for this message
Ian Booth (wallyworld) wrote :

Port 371017 is mongo.
Is mongo busy or otherwise unresponsive at the times it fails?

Changed in juju:
milestone: 3.0.0 → 3.0.1
Changed in juju:
milestone: 3.0.1 → 3.0.2
Changed in juju:
milestone: 3.0.2 → 3.0.3
Revision history for this message
Juan M. Tirado (tiradojm) wrote :

This is an old issue. Reopen if needed.

Changed in juju:
status: Triaged → Incomplete
Changed in juju:
milestone: 3.0.3 → 3.0.4
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for Canonical Juju because there has been no activity for 60 days.]

Changed in juju:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.