Juju Wait Plugin

juju status failure

Bug #1782803 reported by Ashley Lai on 2018-07-20

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Fix Released	High	Tim Penhey
	Juju Wait Plugin	Invalid	Undecided	Unassigned

Bug Description

We have seen a few test runs failed on "juju status --format=json" on juju 2.4.1 version.

foundation/bin/log_wrapper juju wait -m foundations-maas:openstack -t 14400 --workload # up to four hours!
[2018-07-20-01:29:02/deploy]: ERROR:root:ERROR codec.ReadHeader error: error receiving message: read tcp 10.227.81.66:58716->10.244.41.41:17070: read: connection reset by peer

ERROR:root:juju status --format=json failed: 1

Tags:

Revision history for this message

Ashley Lai (alai) wrote on 2018-07-20:

juju-crashdump-controller-708065bb-9bc1-442b-8531-eb2ce20bbf37.tar.gz Edit (687.8 KiB, application/x-tar)

tags:

added: cdo-qa foundation-engine

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2018-07-22:

@Ashley Lai,

Is this on a fresh bootstrap of 2.4 or a system upgraded to 2.4?

Changed in juju:
status:	New → Incomplete

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2018-07-22:

Also, please kindly attach juju logs only. Traversing the huge tree you have attached, under /var/log/juju there is only audit.log which is not useful in diagnosing.

Thank you for your understanding.

Revision history for this message

John George (jog) wrote on 2018-07-23:

@Anastasia, these are fresh bootstraps of 2.4

Chris Gregan (cgregan) on 2018-07-23

Changed in juju:
status:	Incomplete → Confirmed

Revision history for this message

Richard Harding (rharding) wrote on 2018-07-23:

Marking critical as it's become a release blocker. @jog are you able to get @anastasia-macmood the requested logs?

Changed in juju:
importance:	Undecided → Critical

Christopher Lee (veebers) on 2018-07-23

Changed in juju:
assignee:	nobody → Anastasia (anastasia-macmood)

Revision history for this message

John George (jog) wrote on 2018-07-23:

bundle.yaml Edit (25.4 KiB, text/plain)

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2018-07-23:

After further investigations and looking at log output from https://oil-jenkins.canonical.com/job/make_foundation/1580/consoleFull, it is our belief, John George (jog) and I, that the failure is intermittent and is caused by juju-wait plugin https://pastebin.canonical.com/p/5pSVcKzZQS/.

The fact that is also evident in the description message. There is nothing at this stage that points that the failure is with 'juju status --format=json" nor juju code itself.

Changed in juju:
status:	Confirmed → Incomplete
importance:	Critical → High

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2018-07-23:

Marking this as Incomplete and downgrading to High as this is an intermittent failure that does not seem to be "juju" regression. Adding "juju-wait" project.

Revision history for this message

Richard Harding (rharding) wrote on 2018-07-24:

@anastasia, I appreciate the investigation and work with John George but until we get the ok to release 2.4.1 this bug is identified as a release blocker and I'm not comfortable noting it as anything other than critical. Did we get any such ok for release?

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2018-07-24:

#10

@Richard Harding (rharding),

At this stage, there is nothing on juju side that can be done. In fact, everything we have at this stage does not point at "juju". We can keep it as Critical for juju and block the release on it but Juju team cannot fix this.

Considering the intermittency; the fact that the log is pointing to a plugin external to juju; the error is pointing to a potential timing/connectivity issue and the fact that there are other bugs coming from the same environment (machines, scripts, etc like bug # 1783193), I'd prefer to see this report bumped up to Critical when there is a concrete regression in juju.

Anastasia (anastasia-macmood) on 2018-07-24

Changed in juju:
importance:	High → Critical

Revision history for this message

Stuart Bishop (stub) wrote on 2018-07-24:

#11

ERROR:root:ERROR codec.ReadHeader error: error receiving message: read tcp 10.227.81.66:58716->10.244.41.41:17070: read: connection reset by peer

ERROR:root:juju status --format=json failed: 1

The juju-wait plugin tried to run the command 'juju status --format=json'. The juju command failed, emitting the error string 'ERROR codec.ReadHeader error: error receiving message: read tcp 10.227.81.66:58716->10.244.41.41:17070: read: connection reset by peer'. I expect this means exactly what it says, and that the controller terminated the connection or there was a netsplit.

It would be possible for the juju-wait plugin to retry on failure, but I personally believe that sane retry behavior belongs in the 'juju' tool (and is probably already in there, since it is required for proper HA controller support).

Setting this to invalid on juju-wait, as it is correctly exposing the juju error to the end user.

Changed in juju-wait:
status:	New → Invalid

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2018-07-24:

#12

@Stuart Bishop (stub),

Thanks for looking!

We will get the logs from the next solutions QA runs on these machine to see what is going on on juju side.

Tabular output works shortly after this failure, so whatever connection issue was observed does, indeed, look rectified.

I am not clear what retry behavior you refer to in comment # 11 though. Is it a re-run of 'juju-wait' or re-run of 'juju status --format=json'? If you were referring to re-establishing connection, then yes it already exists. Consequently, the error reported maybe due to either a particular setup, network issue (unlikely since obviously most of the time CI runs work as expected... only 1 in about every 10 exhibits this failure) or just the timing of when juju-wait is run (for example, when the connection to the controller is lost/reset)...

Tim Penhey (thumper) on 2018-07-24

tags:

added: cdo-qa-blocker

Revision history for this message

Tim Penhey (thumper) wrote on 2018-07-25:

#13

The only reason we'd see this particular error is if something is happening to the API servers.

For that we need debug logs from the controller machines. Other 2.4.1 deploys seem to be unaffected, so we need to determine what is different with the solutions QA machines.

Revision history for this message

Chris Gregan (cgregan) wrote on 2018-07-25:

#14

juju-crashdump-controller-18893dae-534b-408e-aa8c-dd5a50ed0870.tar.gz Edit (610.8 KiB, application/x-tar)

Revision history for this message

Tim Penhey (thumper) wrote on 2018-07-25:

#15

@Chris, the crashdump above is missing the vital files.

I would have expected to see the following:

/0/baremetal/var/log/juju/machine-0.log
/1/baremetal/var/log/juju/machine-1.log
/2/baremetal/var/log/juju/machine-2.log

Anastasia (anastasia-macmood) on 2018-07-25

Changed in juju:
assignee:	Anastasia (anastasia-macmood) → nobody

Revision history for this message

John George (jog) wrote on 2018-08-06:

#16

The QA scripts have been updated to capture machine logs from the controller but we have not reproduced this issue after that change. July 22nd was the most recent occurrence.

Revision history for this message

Ashley Lai (alai) wrote on 2018-08-15:

#17

juju-crashdump-controller-b35b6a7b-3038-470f-83bf-9d397598107d.tar.gz Edit (19.2 MiB, application/x-tar)

We hit this issue again today. Please find the controller crashdump attached.

Jason Hobbs (jason-hobbs) on 2018-08-20

Changed in juju:
status:	Incomplete → New

Revision history for this message

Tim Penhey (thumper) wrote on 2018-08-21:

#18

A look through the log files showed absolutely nothing out of the ordinary. When was the error from?

I also noticed that it looks like juju status is being called twice every five seconds. Two almost immediately after each other. Is this expected?

Changed in juju:
status:	New → Incomplete

Revision history for this message

Ashley Lai (alai) wrote on 2018-08-22:

#19

@Tim - we call juju status once after the failure. The repeated "juju status" may come from "juju wait". Here's the time when it failed on the crashdump.

[2018-08-20-01:25:52/deploy]: ERROR:root:ERROR codec.ReadHeader error: error receiving message: read tcp 10.227.81.116:38716->10.244.40.224:17070: read: connection reset by peer

Changed in juju:
status:	Incomplete → New

Revision history for this message

Tim Penhey (thumper) wrote on 2018-08-29:

#20

The time you mention there isn't in any of the machine logs. The machine log timestamps should be written in UTC, and range from
2018-08-15 02:17:25
to
2018-08-15 02:45:15

All requests for FullStatus are matched up with a response, so really I have no idea what is going on here.

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2018-08-30:

#21

juju-crashdump-controller-63948965-2442-4981-9ca7-f730d95c2f16.tar.gz Edit (28.3 MiB, application/x-tar)

Jenkins is a bit misleading on how it reports timestamps sometimes. Here is another reproduction. You can see it in the logs for sure:

machine-2: 21:59:22 DEBUG juju.rpc error closing codec: write tcp 10.244.41.40:17070->91.189.92.9:58494: write: broken pipe
machine-2: 21:59:22 ERROR juju.apiserver error serving RPCs: codec.ReadHeader error: error receiving message: read tcp 10.244.41.40:17070->91.189.92.9:58494: read: connection reset by peer

Here's how it showed up on the client side:
ERROR:root:ERROR codec.ReadHeader error: error receiving message: read tcp 10.227.81.236:58494->10.244.41.40:17070: read: connection reset by peer

ERROR:root:juju status --format=json failed: 1

The difference in IP is because we're traversing a NAT'ing router.

Revision history for this message

Tim Penhey (thumper) wrote on 2018-08-30:

#22

From that comment Jason it looks very much like Juju is not at fault. The controller is saying that the client disconnected, and the client is saying the server disconnected.

It looks very much like the thing that is sitting between the client and the controller is at fault here.

Changed in juju:
status:	New → Incomplete

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2018-08-31:

#23

Ok - here's the deal; we need a way to monitor the completion of juju deployments. A transient network blip (let's assume that's the underlying cause here) shouldn't cause a 'deployment failed' signal.

If we can't connect to the juju server for some time, ok, that's reasonable, but this looks like a single connection being dropped is causing that. Perhaps a quick retry afterward would be enough to keep this from surfacing.

stub says he thinks that retrying should be in the juju client and that seems reasonable to me - that keeps everyone using the client from having to implement that behavior.

Is that retrying there today? If not, can it be added?

Changed in juju:
status:	Incomplete → New

Revision history for this message

Tim Penhey (thumper) wrote on 2018-09-12:

#24

That retry isn't there today, nor should it be at the API level.

We will however add a retry inside the status command to help with this.

Changed in juju:
status:	New → Triaged
milestone:	none → 2.4.4
assignee:	nobody → Tim Penhey (thumper)

Revision history for this message

Tim Penhey (thumper) wrote on 2018-09-13:

#25

https://github.com/juju/juju/pull/9191

Changed in juju:
status:	Triaged → Fix Committed

Canonical Juju QA Bot (juju-qa-bot) on 2018-10-08

Changed in juju:
status:	Fix Committed → Fix Released

Jason Hobbs (jason-hobbs) on 2018-12-05

summary:

- juju 2.4.1: juju status failure
+ juju status failure

Richard Harding (rharding) on 2018-12-05

Changed in juju:
status:	Fix Released → Triaged
importance:	Critical → High
milestone:	2.4.4 → 2.5-rc1

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2018-12-05:

#26

We saw this on 2.5 beta 3 run:

11:54:16 INFO cmd bundle.go:426 Deploy of bundle completed.
11:54:16 DEBUG juju.api monitor.go:35 RPC connection died
11:54:16 INFO cmd supercommand.go:465 command finished
ERROR:root:WARNING closing api sessions failed closing statusAPI: codec.ReadHeader error: error receiving message: read tcp 10.227.81.56:60356->10.244.41.23:17070: read: connection reset by peer
ERROR connection is shut down

ERROR:root:juju status --format=json failed: 1

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2018-12-05:

#27

juju-crashdump-controller-2018-12-05-12.31.56.tar.gz Edit (7.5 MiB, application/x-tar)

Revision history for this message

Tim Penhey (thumper) wrote on 2018-12-11:

#28

This isn't the same bug as before, 2.5 does have the retries as well.

This has to do with the multiwatcher that the bundle deployer has hitting an error.
Can we have a different bug for this please?

Also, can we run the controllers with at least INFO logging? Those logs aren't really useful.

Revision history for this message

Tim Penhey (thumper) wrote on 2018-12-11:

#29

At 11:54 the mongo replicaset was doing things. Is it possible that the deploy is being attempted before the controllers are all talking to each other?

Changed in juju:
status:	Triaged → Invalid
milestone:	2.5-rc1 → none

Revision history for this message

Tim Penhey (thumper) wrote on 2018-12-11:

#30

Ah... perhaps the multiwatcher is not what we care about.

It is hard to tell from the output above if the first ERROR is from the status call or the deploy call.

There are args added to status now for number of retries and retry delay, perhaps your script should utilise those?

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2018-12-12:

#31

Tim, the error above is from juju status.

We are only seeing this issue on juju 2.5 beta 3. We could add retries, but if this were an underlying network issue I would expect to see these failures on other versions of Juju also. It seems like something may be wrong with juju 2.5 beta 3.

Changed in juju:
status:	Invalid → New

Revision history for this message

Richard Harding (rharding) wrote on 2019-01-14:

#32

Jason, can you confirm you're still hitting this and per #28 see if we can file a clean bug to diagnose this as it does look like something different is going on at this point?

Thanks!

Changed in juju:
status:	New → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.