juju status failure

Bug #1782803 reported by Ashley Lai
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Tim Penhey
Juju Wait Plugin
Invalid
Undecided
Unassigned

Bug Description

We have seen a few test runs failed on "juju status --format=json" on juju 2.4.1 version.

foundation/bin/log_wrapper juju wait -m foundations-maas:openstack -t 14400 --workload # up to four hours!
[2018-07-20-01:29:02/deploy]: ERROR:root:ERROR codec.ReadHeader error: error receiving message: read tcp 10.227.81.66:58716->10.244.41.41:17070: read: connection reset by peer

ERROR:root:juju status --format=json failed: 1

Revision history for this message
Ashley Lai (alai) wrote :
tags: added: cdo-qa foundation-engine
Revision history for this message
Anastasia (anastasia-macmood) wrote :

@Ashley Lai,

Is this on a fresh bootstrap of 2.4 or a system upgraded to 2.4?

Changed in juju:
status: New → Incomplete
Revision history for this message
Anastasia (anastasia-macmood) wrote :

Also, please kindly attach juju logs only. Traversing the huge tree you have attached, under /var/log/juju there is only audit.log which is not useful in diagnosing.

Thank you for your understanding.

Revision history for this message
John George (jog) wrote :

@Anastasia, these are fresh bootstraps of 2.4

Chris Gregan (cgregan)
Changed in juju:
status: Incomplete → Confirmed
Revision history for this message
Richard Harding (rharding) wrote :

Marking critical as it's become a release blocker. @jog are you able to get @anastasia-macmood the requested logs?

Changed in juju:
importance: Undecided → Critical
Changed in juju:
assignee: nobody → Anastasia (anastasia-macmood)
Revision history for this message
John George (jog) wrote :
Revision history for this message
Anastasia (anastasia-macmood) wrote :

After further investigations and looking at log output from https://oil-jenkins.canonical.com/job/make_foundation/1580/consoleFull, it is our belief, John George (jog) and I, that the failure is intermittent and is caused by juju-wait plugin https://pastebin.canonical.com/p/5pSVcKzZQS/.

The fact that is also evident in the description message. There is nothing at this stage that points that the failure is with 'juju status --format=json" nor juju code itself.

Changed in juju:
status: Confirmed → Incomplete
importance: Critical → High
Revision history for this message
Anastasia (anastasia-macmood) wrote :

Marking this as Incomplete and downgrading to High as this is an intermittent failure that does not seem to be "juju" regression. Adding "juju-wait" project.

Revision history for this message
Richard Harding (rharding) wrote :

@anastasia, I appreciate the investigation and work with John George but until we get the ok to release 2.4.1 this bug is identified as a release blocker and I'm not comfortable noting it as anything other than critical. Did we get any such ok for release?

Revision history for this message
Anastasia (anastasia-macmood) wrote :

@Richard Harding (rharding),

At this stage, there is nothing on juju side that can be done. In fact, everything we have at this stage does not point at "juju". We can keep it as Critical for juju and block the release on it but Juju team cannot fix this.

Considering the intermittency; the fact that the log is pointing to a plugin external to juju; the error is pointing to a potential timing/connectivity issue and the fact that there are other bugs coming from the same environment (machines, scripts, etc like bug # 1783193), I'd prefer to see this report bumped up to Critical when there is a concrete regression in juju.

Changed in juju:
importance: High → Critical
Revision history for this message
Stuart Bishop (stub) wrote :

ERROR:root:ERROR codec.ReadHeader error: error receiving message: read tcp 10.227.81.66:58716->10.244.41.41:17070: read: connection reset by peer

ERROR:root:juju status --format=json failed: 1

The juju-wait plugin tried to run the command 'juju status --format=json'. The juju command failed, emitting the error string 'ERROR codec.ReadHeader error: error receiving message: read tcp 10.227.81.66:58716->10.244.41.41:17070: read: connection reset by peer'. I expect this means exactly what it says, and that the controller terminated the connection or there was a netsplit.

It would be possible for the juju-wait plugin to retry on failure, but I personally believe that sane retry behavior belongs in the 'juju' tool (and is probably already in there, since it is required for proper HA controller support).

Setting this to invalid on juju-wait, as it is correctly exposing the juju error to the end user.

Changed in juju-wait:
status: New → Invalid
Revision history for this message
Anastasia (anastasia-macmood) wrote :

@Stuart Bishop (stub),

Thanks for looking!

We will get the logs from the next solutions QA runs on these machine to see what is going on on juju side.

Tabular output works shortly after this failure, so whatever connection issue was observed does, indeed, look rectified.

I am not clear what retry behavior you refer to in comment # 11 though. Is it a re-run of 'juju-wait' or re-run of 'juju status --format=json'? If you were referring to re-establishing connection, then yes it already exists. Consequently, the error reported maybe due to either a particular setup, network issue (unlikely since obviously most of the time CI runs work as expected... only 1 in about every 10 exhibits this failure) or just the timing of when juju-wait is run (for example, when the connection to the controller is lost/reset)...

Tim Penhey (thumper)
tags: added: cdo-qa-blocker
Revision history for this message
Tim Penhey (thumper) wrote :

The only reason we'd see this particular error is if something is happening to the API servers.

For that we need debug logs from the controller machines. Other 2.4.1 deploys seem to be unaffected, so we need to determine what is different with the solutions QA machines.

Revision history for this message
Chris Gregan (cgregan) wrote :
Revision history for this message
Tim Penhey (thumper) wrote :

@Chris, the crashdump above is missing the vital files.

I would have expected to see the following:

/0/baremetal/var/log/juju/machine-0.log
/1/baremetal/var/log/juju/machine-1.log
/2/baremetal/var/log/juju/machine-2.log

Changed in juju:
assignee: Anastasia (anastasia-macmood) → nobody
Revision history for this message
John George (jog) wrote :

The QA scripts have been updated to capture machine logs from the controller but we have not reproduced this issue after that change. July 22nd was the most recent occurrence.

Revision history for this message
Ashley Lai (alai) wrote :

We hit this issue again today. Please find the controller crashdump attached.

Changed in juju:
status: Incomplete → New
Revision history for this message
Tim Penhey (thumper) wrote :

A look through the log files showed absolutely nothing out of the ordinary. When was the error from?

I also noticed that it looks like juju status is being called twice every five seconds. Two almost immediately after each other. Is this expected?

Changed in juju:
status: New → Incomplete
Revision history for this message
Ashley Lai (alai) wrote :

@Tim - we call juju status once after the failure. The repeated "juju status" may come from "juju wait". Here's the time when it failed on the crashdump.

[2018-08-20-01:25:52/deploy]: ERROR:root:ERROR codec.ReadHeader error: error receiving message: read tcp 10.227.81.116:38716->10.244.40.224:17070: read: connection reset by peer

Changed in juju:
status: Incomplete → New
Revision history for this message
Tim Penhey (thumper) wrote :

The time you mention there isn't in any of the machine logs. The machine log timestamps should be written in UTC, and range from
  2018-08-15 02:17:25
to
  2018-08-15 02:45:15

All requests for FullStatus are matched up with a response, so really I have no idea what is going on here.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Jenkins is a bit misleading on how it reports timestamps sometimes. Here is another reproduction. You can see it in the logs for sure:

machine-2: 21:59:22 DEBUG juju.rpc error closing codec: write tcp 10.244.41.40:17070->91.189.92.9:58494: write: broken pipe
machine-2: 21:59:22 ERROR juju.apiserver error serving RPCs: codec.ReadHeader error: error receiving message: read tcp 10.244.41.40:17070->91.189.92.9:58494: read: connection reset by peer

Here's how it showed up on the client side:
ERROR:root:ERROR codec.ReadHeader error: error receiving message: read tcp 10.227.81.236:58494->10.244.41.40:17070: read: connection reset by peer

ERROR:root:juju status --format=json failed: 1

The difference in IP is because we're traversing a NAT'ing router.

Revision history for this message
Tim Penhey (thumper) wrote :

From that comment Jason it looks very much like Juju is not at fault. The controller is saying that the client disconnected, and the client is saying the server disconnected.

It looks very much like the thing that is sitting between the client and the controller is at fault here.

Changed in juju:
status: New → Incomplete
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Ok - here's the deal; we need a way to monitor the completion of juju deployments. A transient network blip (let's assume that's the underlying cause here) shouldn't cause a 'deployment failed' signal.

If we can't connect to the juju server for some time, ok, that's reasonable, but this looks like a single connection being dropped is causing that. Perhaps a quick retry afterward would be enough to keep this from surfacing.

stub says he thinks that retrying should be in the juju client and that seems reasonable to me - that keeps everyone using the client from having to implement that behavior.

Is that retrying there today? If not, can it be added?

Changed in juju:
status: Incomplete → New
Revision history for this message
Tim Penhey (thumper) wrote :

That retry isn't there today, nor should it be at the API level.

We will however add a retry inside the status command to help with this.

Changed in juju:
status: New → Triaged
milestone: none → 2.4.4
assignee: nobody → Tim Penhey (thumper)
Revision history for this message
Tim Penhey (thumper) wrote :
Changed in juju:
status: Triaged → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
summary: - juju 2.4.1: juju status failure
+ juju status failure
Changed in juju:
status: Fix Released → Triaged
importance: Critical → High
milestone: 2.4.4 → 2.5-rc1
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

We saw this on 2.5 beta 3 run:

11:54:16 INFO cmd bundle.go:426 Deploy of bundle completed.
11:54:16 DEBUG juju.api monitor.go:35 RPC connection died
11:54:16 INFO cmd supercommand.go:465 command finished
ERROR:root:WARNING closing api sessions failed closing statusAPI: codec.ReadHeader error: error receiving message: read tcp 10.227.81.56:60356->10.244.41.23:17070: read: connection reset by peer
ERROR connection is shut down

ERROR:root:juju status --format=json failed: 1

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
Revision history for this message
Tim Penhey (thumper) wrote :

This isn't the same bug as before, 2.5 does have the retries as well.

This has to do with the multiwatcher that the bundle deployer has hitting an error.
Can we have a different bug for this please?

Also, can we run the controllers with at least INFO logging? Those logs aren't really useful.

Revision history for this message
Tim Penhey (thumper) wrote :

At 11:54 the mongo replicaset was doing things. Is it possible that the deploy is being attempted before the controllers are all talking to each other?

Changed in juju:
status: Triaged → Invalid
milestone: 2.5-rc1 → none
Revision history for this message
Tim Penhey (thumper) wrote :

Ah... perhaps the multiwatcher is not what we care about.

It is hard to tell from the output above if the first ERROR is from the status call or the deploy call.

There are args added to status now for number of retries and retry delay, perhaps your script should utilise those?

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Tim, the error above is from juju status.

We are only seeing this issue on juju 2.5 beta 3. We could add retries, but if this were an underlying network issue I would expect to see these failures on other versions of Juju also. It seems like something may be wrong with juju 2.5 beta 3.

Changed in juju:
status: Invalid → New
Revision history for this message
Richard Harding (rharding) wrote :

Jason, can you confirm you're still hitting this and per #28 see if we can file a clean bug to diagnose this as it does look like something different is going on at this point?

Thanks!

Changed in juju:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.