Bug #1597601 “ERROR cannot deploy bundle: cannot deploy applicat...” : Bugs : Canonical Juju

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2016-06-30:

#1

juju_debug_machine-0 (1).txt Edit (74.5 KiB, text/plain)

description:

updated

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2016-06-30:

#2

Jason, can you also attach /var/log/syslog from the controller machine?

Changed in juju-core:
status:	New → Incomplete

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2016-07-13:

#3

bootstrap_logs.tgz Edit (108.4 KiB, application/x-tar)

I recreated this, and have attached /var/log from the controller machine.

Changed in juju-core:
status:	Incomplete → New

Jason Hobbs (jason-hobbs) on 2016-07-13

description:

updated

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2016-07-13:

#4

Jason - would you be able to provide the bundle being deployed to see if we can recreate?

Changed in juju-core:
status:	New → Triaged
importance:	Undecided → High
milestone:	none → 2.0-beta13
tags:	added: bundles

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2016-07-13:

#5

Cheryl,

Here is the bundle that was used in the deployment I captured the bootstrap node logs from.

http://paste.ubuntu.com/19286457/

It doesn't happen everytime I deploy with this bundle - it's a sporadic failure, so you may not be able to reproduce it easily.

Jason Hobbs (jason-hobbs) on 2016-07-13

tags:

added: oil-2.0

Cheryl Jennings (cherylj) on 2016-07-14

tags:

added: 2.0

Curtis Hovey (sinzui) on 2016-07-19

tags:	added: deploy repeatability
tags:	added: retry

Curtis Hovey (sinzui) on 2016-07-22

Changed in juju-core:
milestone:	2.0-beta13 → 2.0-beta14

Menno Finlay-Smits (menno.smits) on 2016-07-25

Changed in juju-core:
assignee:	nobody → Menno Smits (menno.smits)

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2016-07-25:

#6

I pretty sure what's happening is that "juju enable-ha" is being run at some point and then the deploy continues. As each new controller node created by enable-ha comes up and joins Juju's MongoDB replicaset, a MongoDB leader election is triggered which causes each mongod to drop it's connections (this is by design).

MongoDB's connections dropping causes the controller agent's workers to restart and reconnect as they should but if an API request comes in just the right time it may be actioned just as the MongoDB connection goes down. This results in the i/o timeout error.

MongoDB's logs are in /var/log/syslog. Here's what mongod on machine-0 logged as a new replicaset member joined:

Jul 13 09:50:59 ubuntu mongod.37017[4593]: [ReplicationExecutor] transition to RECOVERING
Jul 13 09:50:59 ubuntu mongod.37017[4593]: [ReplicationExecutor] transition to SECONDARY
Jul 13 09:50:59 ubuntu mongod.37017[4593]: [ReplicationExecutor] conducting a dry run election to see if we could be elected
Jul 13 09:50:59 ubuntu mongod.37017[4593]: [ReplicationExecutor] dry election run succeeded, running for election
Jul 13 09:50:59 ubuntu mongod.37017[4593]: [ReplicationExecutor] election succeeded, assuming primary role in term 1
Jul 13 09:50:59 ubuntu mongod.37017[4593]: [ReplicationExecutor] transition to PRIMARY

The timestamp correspond to when connections and workers are shut down in the machine agent, and when the i/o timeout error occurred.

The immediate workaround is wait for the new controller machines to come up after "juju enable-ha" is issued, before continuing with the deployment (being aware that the i/o timeout might occur while status is polled).

Ideally, Juju would hide all this from the user but I'm not sure what the right approach should be. The fix is likely to be non-trivial and is unlikely to happen before 2.0. This problem isn't new. I'll take this to the juju-dev list for comment.

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2016-07-26: Re: [Bug 1597601] Re: ERROR cannot deploy bundle: cannot deploy application: i/o timeout

#7

In OIL, we're never running enable-ha (unless this happens implicitly by
doing a bootstrap/deploy) and we hit this regularly.

On Tue, Jul 26, 2016 at 1:57 AM, Menno Smits <email address hidden>
wrote:

> I pretty sure what's happening is that "juju enable-ha" is being run at
> some point and then the deploy continues. As each new controller node
> created by enable-ha comes up and joins Juju's MongoDB replicaset, a
> MongoDB leader election is triggered which causes each mongod to drop
> it's connections (this is by design).
>
> MongoDB's connections dropping causes the controller agent's workers to
> restart and reconnect as they should but if an API request comes in just
> the right time it may be actioned just as the MongoDB connection goes
> down. This results in the i/o timeout error.
>
> MongoDB's logs are in /var/log/syslog. Here's what mongod on machine-0
> logged as a new replicaset member joined:
>
> Jul 13 09:50:59 ubuntu mongod.37017[4593]: [ReplicationExecutor]
> transition to RECOVERING
> Jul 13 09:50:59 ubuntu mongod.37017[4593]: [ReplicationExecutor]
> transition to SECONDARY
> Jul 13 09:50:59 ubuntu mongod.37017[4593]: [ReplicationExecutor]
> conducting a dry run election to see if we could be elected
> Jul 13 09:50:59 ubuntu mongod.37017[4593]: [ReplicationExecutor] dry
> election run succeeded, running for election
> Jul 13 09:50:59 ubuntu mongod.37017[4593]: [ReplicationExecutor] election
> succeeded, assuming primary role in term 1
> Jul 13 09:50:59 ubuntu mongod.37017[4593]: [ReplicationExecutor]
> transition to PRIMARY
>
> The timestamp correspond to when connections and workers are shut down
> in the machine agent, and when the i/o timeout error occurred.
>
> The immediate workaround is wait for the new controller machines to come
> up after "juju enable-ha" is issued, before continuing with the
> deployment (being aware that the i/o timeout might occur while status is
> polled).
>
> Ideally, Juju would hide all this from the user but I'm not sure what
> the right approach should be. The fix is likely to be non-trivial and is
> unlikely to happen before 2.0. This problem isn't new. I'll take this to
> the juju-dev list for comment.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1597601
>
> Title:
> ERROR cannot deploy bundle: cannot deploy application: i/o timeout
>
> Status in juju-core:
> Triaged
>
> Bug description:
> Running this command:
> juju-2.0 deploy -m maas:default artifacts/oil_deployment
>
> I got this output:
> added charm cs:trusty/cinder-253
> ERROR cannot deploy bundle: cannot deploy application "cinder": cannot
> add application "cinder": read tcp 127.0.0.1:54084->127.0.0.1:37017: i/o
> timeout
>
> I've attached the machine-0 log.
>
> From this build in OIL:
> http://10.245.162.43:8080/job/pipeline_deploy/6818/
>
> This is with 2.0-beta10 and 2.0beta11.
>
> We hit this a couple of times a day in OIL.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju-core/+bug/1597601/+subscriptions
>

In OIL, we're never running enable-ha (unless this happens implicitly by
doing a bootstrap/deploy) and we hit this regularly.

On Tue, Jul 26, 2016 at 1:57 AM, Menno Smits <menno.smits@canonical.com>
wrote:

> I pretty sure what's happening is that "juju enable-ha" is being run at
> some point and then the deploy continues. As each new controller node
> created by enable-ha comes up and joins Juju's MongoDB replicaset, a
> MongoDB leader election is triggered which causes each mongod to drop
> it's connections (this is by design).
>
> MongoDB's connections dropping causes the controller agent's workers to
> restart and reconnect as they should but if an API request comes in just
> the right time it may be actioned just as the MongoDB connection goes
> down. This results in the i/o timeout error.
>
> MongoDB's logs are in /var/log/syslog. Here's what mongod on machine-0
> logged as a new replicaset member joined:
>
> Jul 13 09:50:59 ubuntu mongod.37017[4593]: [ReplicationExecutor]
> transition to RECOVERING
> Jul 13 09:50:59 ubuntu mongod.37017[4593]: [ReplicationExecutor]
> transition to SECONDARY
> Jul 13 09:50:59 ubuntu mongod.37017[4593]: [ReplicationExecutor]
> conducting a dry run election to see if we could be elected
> Jul 13 09:50:59 ubuntu mongod.37017[4593]: [ReplicationExecutor] dry
> election run succeeded, running for election
> Jul 13 09:50:59 ubuntu mongod.37017[4593]: [ReplicationExecutor] election
> succeeded, assuming primary role in term 1
> Jul 13 09:50:59 ubuntu mongod.37017[4593]: [ReplicationExecutor]
> transition to PRIMARY
>
> The timestamp correspond to when connections and workers are shut down
> in the machine agent, and when the i/o timeout error occurred.
>
> The immediate workaround is wait for the new controller machines to come
> up after "juju enable-ha" is issued, before continuing with the
> deployment (being aware that the i/o timeout might occur while status is
> polled).
>
> Ideally, Juju would hide all this from the user but I'm not sure what
> the right approach should be. The fix is likely to be non-trivial and is
> unlikely to happen before 2.0. This problem isn't new. I'll take this to
> the juju-dev list for comment.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1597601
>
> Title:
>   ERROR cannot deploy bundle: cannot deploy application: i/o timeout
>
> Status in juju-core:
>   Triaged
>
> Bug description:
>   Running this command:
>   juju-2.0 deploy -m maas:default artifacts/oil_deployment
>
>   I got this output:
>   added charm cs:trusty/cinder-253
>   ERROR cannot deploy bundle: cannot deploy application "cinder": cannot
> add application "cinder": read tcp 127.0.0.1:54084->127.0.0.1:37017: i/o
> timeout
>
>   I've attached the machine-0 log.
>
>   From this build in OIL:
>   http://10.245.162.43:8080/job/pipeline_deploy/6818/
>
>   This is with 2.0-beta10 and 2.0beta11.
>
>   We hit this a couple of times a day in OIL.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju-core/+bug/1597601/+subscriptions
>

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2016-07-27:

#8

Ok, that's good know. It looks like MongoDB will do these leader elections even when there's 1 member. That rings a bell now that I think about it. It happens soon after MongoDB has come up.

I'm bringing this topic up with the Juju tech board tonight.

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2016-08-01:

#9

@jason-hobbs: It's looking like that it's not going to feasible to prevent an error bubbling up to the client when MongoDB performs a leader election. What we can do is translate that error into a clear, easily identifiable error which indicates that the request should be retried. Would this be acceptable to you?

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2016-08-02:

#10

Menno,

Why can't juju retry if it's retryable? This sounds like a condition internal to juju/mongodb, not something external about the environment I need to fix as an operator. I don't want to have to write a wrapper around juju's cli to do what juju should be doing itself, which is deploying what I asked for unless there is a fatal error caused by something in the environment that is beyond juju's control.

Jason

Revision history for this message

Richard Harding (rharding) wrote on 2016-08-02:

#11

Menno, I have to agree with Jason. Can you clarify the 'feasible'. What I could see is that every write to mongo has to be packed with a retry mechanism in order to make this completely invisible. That's an insane amount of work, however, it seems that we'd be able to target specific operations that users perform and takes time/expects to work such as deploying a bundle and the operations contained in there, adding an application (where this appears to have failed) and at least start to set an example in the code.

I'd love to get more details as to the tech board's thoughts on how to prevent this longer term. If I'm mistaken, and the size of the exposed surface of the problem isn't what makes it not feasible, I'd be curious what that is and if there are other ways we can make a dent in the problem.

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2016-08-02:

#12

Sorry, I wasn't clear. The operator wouldn't have to retry - the API client (in this case the juju CLI) would though. There would be no pain from the operator's perspective as this would be handled transparently but it would mean that every API client would need to be prepared to handle the "please retry" error.

@jason-hobbs: My question was more, "would it be acceptable for the API server to return a 'please retry' error?" but if you're not consuming the API directly then I guess it doesn't really matter to you. It will matter to the juju GUI, landscape teams etc. I'll take this to the juju-dev list.

Revision history for this message

Richard Harding (rharding) wrote on 2016-08-02:

#13

Thanks for the clarification Menno. It still seems that there's some code in Juju that is talking to mongo as a remote data service that could perform a retry much as we would against a cloud api. If we cannot though, the next best thing is the client retrying for the user in a controlled fashion.

If/when we head down that path we need to make sure to establish a pattern of making sure things are clear that we know a retry should be handled here by clients, much like noting this can throw an error.

Menno Finlay-Smits (menno.smits) on 2016-08-03

Changed in juju-core:
assignee:	Menno Smits (menno.smits) → nobody

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2016-08-03:

#14

Rick: we do indeed talk to MongoDB over an API but the machine agent workers (including the API server) are all arranged in such a way that they stopped when the connection to MongoDB dies. This provides a robust way for workers to reconfigure themselves when the MongoDB master changes, and also just a way to "start from scratch" when MongoDB has had a problem.

Unfortunately, this behaviour can also bubble up to the client.

The Juju controllers used to use MongoDB's idea of a master to drive where certain controller based workers ran. In this case it was particularly useful to have all workers which ultimately rely on a connection to MongoDB be restarted when the connection dropped due to a leader election. We're moving away from using the MongoDB master for this now but I believe there's still one or two places where is matters.

I think that Will might some useful input on all this so I'm going to poke him about it.

Changed in juju-core:
assignee:	nobody → Alexis Bruemmer (alexis-bruemmer)

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2016-08-03:

#15

I need to refocus on other things so I'm summarising what I think needs to be done at a high level, if we go with the "please retry" error approach.

1. A new API error type needs to implemented which indicates to the client that the last request should be retried.
2. The API server needs to translate MongoDB "i/o timeout" errors into this new retry error.
3. The client needs to be updated (hopefully in just one place) to transparently retry when the retry error is seen.

I've assigned to Alexis so this ticket doesn't get forgotten.

Tim Penhey (thumper) on 2016-08-03

Changed in juju-core:
assignee:	Alexis Bruemmer (alexis-bruemmer) → Tim Penhey (thumper)
status:	Triaged → In Progress

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2016-08-03:

#16

This issue was discussed further at last night's tech board meeting. It was decided that returning a "please retry" error for the client to handle is a good immediate measure. It makes the situation better than it is now and the concept of such an error is useful to have available.

It was also decided that we should look to rearrange the API server so it doesn't have to restart when the connection to MongoDB dies. This will be a significant effort which we will track as a tech debt item (adding to the board now).

Curtis Hovey (sinzui) on 2016-08-04

Changed in juju-core:
milestone:	2.0-beta14 → 2.0-beta15

Revision history for this message

Tim Penhey (thumper) wrote on 2016-08-05:

#17

Although I wasn't able to reproduce this problem, we think it is now fixed.

Changed in juju-core:
status:	In Progress → Fix Committed

Curtis Hovey (sinzui) on 2016-08-11

Changed in juju-core:
status:	Fix Committed → Fix Released

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2016-08-22:

#18

We're hitting this in beta 15.

Changed in juju-core:
status:	Fix Released → New

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2016-08-22:

#19

Here's the run in OIL we hit this on (internal IP): http://10.245.162.43:8080/job/pipeline_deploy/9178/console

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2016-08-22:

#20

bootstrap_logs.tgz Edit (132.2 KiB, application/x-tar)

Logs from the bootstrap node from the failure on August 22nd.

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2016-08-22:

#21

bootstrap_logs (1).tgz Edit (105.8 KiB, application/x-tar)

Logs from another run (9153) on the 22nd that hit this bug.

Anastasia (anastasia-macmood) on 2016-08-22

Changed in juju-core:
status:	New → Triaged
milestone:	2.0-beta15 → 2.0-beta16

Revision history for this message

Tim Penhey (thumper) wrote on 2016-08-23:

#22

@Jason, is there a way we can get debug logs? The default output that we get in the machine-0.log is not good enough to help debug this.

Are you able to set the logging config on the controller?

Preferrably 'juju=DEBUG'. You can get this by bootstrapping using the --debug option, or later by calling 'juju set-model-config logging-config=juju=DEBUG'

Canonical Juju QA Bot (juju-qa-bot) on 2016-08-23

affects:	juju-core → juju
Changed in juju:
milestone:	2.0-beta16 → none
milestone:	none → 2.0-beta16

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2016-08-23:

#23

Yes Tim, we'll enable debug logging.

Jason

On Mon, Aug 22, 2016 at 7:08 PM, Tim Penhey <email address hidden>
wrote:

> @Jason, is there a way we can get debug logs? The default output that
> we get in the machine-0.log is not good enough to help debug this.
>
> Are you able to set the logging config on the controller?
>
> Preferrably 'juju=DEBUG'. You can get this by bootstrapping using the
> --debug option, or later by calling 'juju set-model-config logging-
> config=juju=DEBUG'
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1597601
>
> Title:
> ERROR cannot deploy bundle: cannot deploy application: i/o timeout
>
> Status in juju-core:
> Triaged
>
> Bug description:
> Running this command:
> juju-2.0 deploy -m maas:default artifacts/oil_deployment
>
> I got this output:
> added charm cs:trusty/cinder-253
> ERROR cannot deploy bundle: cannot deploy application "cinder": cannot
> add application "cinder": read tcp 127.0.0.1:54084->127.0.0.1:37017: i/o
> timeout
>
> I've attached the machine-0 log.
>
> From this build in OIL:
> http://10.245.162.43:8080/job/pipeline_deploy/6818/
>
> This is with 2.0-beta10 and 2.0beta11.
>
> We hit this a couple of times a day in OIL.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju-core/+bug/1597601/+subscriptions
>

Revision history for this message

Larry Michel (lmic) wrote on 2016-08-25:

#24

bootstrap_logs (5).tgz Edit (182.5 KiB, application/x-tar)

Here's from latest recreate with debug enabled.

Curtis Hovey (sinzui) on 2016-08-25

Changed in juju:
milestone:	2.0-beta16 → 2.0-beta17

Revision history for this message

Tim Penhey (thumper) wrote on 2016-08-31:

#25

OK, think I have it this time. Proposing fix.

Changed in juju:
status:	Triaged → In Progress

Tim Penhey (thumper) on 2016-08-31

Changed in juju:
status:	In Progress → Fix Committed

Curtis Hovey (sinzui) on 2016-09-01

Changed in juju:
status:	Fix Committed → Fix Released

Revision history for this message

Larry Michel (lmic) wrote on 2016-09-15:

#26

bootstrap_logs (7).tgz Edit (176.9 KiB, application/x-tar)

We were able to recreate with Beta18.

2016-09-15 06:28:01,642 [INFO] oil_ci.juju.juju2: deploying: juju-2.0 deploy -m ci-oil-slave9:default artifacts/oil_deployment
Deploying charm "cs:trusty/ceph-mon-3"
ERROR cannot deploy bundle: cannot deploy application "ceph-mon": cannot add application "ceph-mon": read tcp 10.244.241.42:41782->10.244.241.42:37017: i/o timeout

Log from bootstrap node attached.

Jason Hobbs (jason-hobbs) on 2016-09-15

Changed in juju:
status:	Fix Released → New

Tim Penhey (thumper) on 2016-09-15

Changed in juju:
milestone:	2.0-beta17 → 2.0-rc1

Menno Finlay-Smits (menno.smits) on 2016-09-15

Changed in juju:
assignee:	Tim Penhey (thumper) → Menno Smits (menno.smits)

Anastasia (anastasia-macmood) on 2016-09-16

Changed in juju:
status:	New → Triaged

Menno Finlay-Smits (menno.smits) on 2016-09-18

Changed in juju:
status:	Triaged → In Progress

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2016-09-19:

#27

I can see what the problem is. The retry logic only kicks in if the entire API request returns a retry error but since almost all of Juju's APIs accept multiple entities it's possible for some parts of the API request to succeed while others fail. A failure for the work done in an API handler for a given entity is much more likely.

I have a fix in mind but is non-trivial and has a risk of introducing more problems. I've been working on that today.

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2016-09-19:

#28

I've been unable to reproduce the problem, even when using the same bundle you are. The best I've been able to do is artificially simulate the issue by modifying Juju to randomly generate the appropriate error. This isn't quite the same though makes it hard to know whether any fix I come up with will work.

Jason and Larry: If I provide you with a custom Juju build, would it be possible for you to run your tests with it? I'd be a lot happier merging the fixes I'm thinking about knowing that they've worked for you without introducing other problems.

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2016-09-19:

#29

Menno - I can test with another build given instructions to do so - let me
know!

Jason

On Mon, Sep 19, 2016 at 12:12 AM, Menno Smits <email address hidden>
wrote:

> I've been unable to reproduce the problem, even when using the same
> bundle you are. The best I've been able to do is artificially simulate
> the issue by modifying Juju to randomly generate the appropriate error.
> This isn't quite the same though makes it hard to know whether any fix I
> come up with will work.
>
> Jason and Larry: If I provide you with a custom Juju build, would it be
> possible for you to run your tests with it? I'd be a lot happier merging
> the fixes I'm thinking about knowing that they've worked for you without
> introducing other problems.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1597601
>
> Title:
> ERROR cannot deploy bundle: cannot deploy application: i/o timeout
>
> Status in juju:
> In Progress
>
> Bug description:
> Running this command:
> juju-2.0 deploy -m maas:default artifacts/oil_deployment
>
> I got this output:
> added charm cs:trusty/cinder-253
> ERROR cannot deploy bundle: cannot deploy application "cinder": cannot
> add application "cinder": read tcp 127.0.0.1:54084->127.0.0.1:37017: i/o
> timeout
>
> I've attached the machine-0 log.
>
> From this build in OIL:
> http://10.245.162.43:8080/job/pipeline_deploy/6818/
>
> This is with 2.0-beta10 and 2.0beta11.
>
> We hit this a couple of times a day in OIL.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1597601/+subscriptions
>

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2016-09-19:

#30

I can now reproduce the problem more accurately now, if still somewhat artificially.

In the latest log tarball I had noticed that mongodb logs in /var/log/syslog stopped for the period between 06:28:15 and 06:28:40. It's during this period where the machine-0 jujud logged tons of i/o timeout errors. It's almost like mongodb got stuck for a while or was heavily loaded for some reason (although I see no reason why this should be so).

This gave me the idea to try pausing the mongod process using SIGSTOP. By doing this I'm able to cause jujud to experience the same i/o timeout errors. By pausing the mongod process at the right place inside a Juju's API handler I'm able to get the i/o timeouts errors at the client as reported. I suspect that for some reason mongod is occasionally getting stuck/busy for a while and this is what's cause the i/o timeout errors in OIL. MongoDB's leader election (which typically happens soon after bootstrapping the controller) are certainly another cause, but that's not happened in the the latest logs attached to this ticket.

Being able to do this will speed up the development of a fix/workaround. I'd still like it to be tested within the OIL setup once I have it ready though.

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2016-09-19:

#31

Hey Menno,

Sorry, I was confused on which bug this was - we won't be able to take non
released builds for this. We hit this running in our general pipeline - it
takes many deployments to ever reproduce it - and we only run released
builds in that.

Jason

On Mon, Sep 19, 2016 at 5:33 AM, Menno Smits <email address hidden>
wrote:

> I can now reproduce the problem more accurately now, if still somewhat
> artificially.
>
> In the latest log tarball I had noticed that mongodb logs in
> /var/log/syslog stopped for the period between 06:28:15 and 06:28:40.
> It's during this period where the machine-0 jujud logged tons of i/o
> timeout errors. It's almost like mongodb got stuck for a while or was
> heavily loaded for some reason (although I see no reason why this should
> be so).
>
> This gave me the idea to try pausing the mongod process using SIGSTOP.
> By doing this I'm able to cause jujud to experience the same i/o timeout
> errors. By pausing the mongod process at the right place inside a Juju's
> API handler I'm able to get the i/o timeouts errors at the client as
> reported. I suspect that for some reason mongod is occasionally getting
> stuck/busy for a while and this is what's cause the i/o timeout errors
> in OIL. MongoDB's leader election (which typically happens soon after
> bootstrapping the controller) are certainly another cause, but that's
> not happened in the the latest logs attached to this ticket.
>
> Being able to do this will speed up the development of a fix/workaround.
> I'd still like it to be tested within the OIL setup once I have it ready
> though.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1597601
>
> Title:
> ERROR cannot deploy bundle: cannot deploy application: i/o timeout
>
> Status in juju:
> In Progress
>
> Bug description:
> Running this command:
> juju-2.0 deploy -m maas:default artifacts/oil_deployment
>
> I got this output:
> added charm cs:trusty/cinder-253
> ERROR cannot deploy bundle: cannot deploy application "cinder": cannot
> add application "cinder": read tcp 127.0.0.1:54084->127.0.0.1:37017: i/o
> timeout
>
> I've attached the machine-0 log.
>
> From this build in OIL:
> http://10.245.162.43:8080/job/pipeline_deploy/6818/
>
> This is with 2.0-beta10 and 2.0beta11.
>
> We hit this a couple of times a day in OIL.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1597601/+subscriptions
>

Hey Menno,

Sorry, I was confused on which bug this was - we won't be able to take non
released builds for this. We hit this running in our general pipeline - it
takes many deployments to ever reproduce it - and we only run released
builds in that.

Jason

On Mon, Sep 19, 2016 at 5:33 AM, Menno Smits <menno.smits@canonical.com>
wrote:

> I can now reproduce the problem more accurately now, if still somewhat
> artificially.
>
> In the latest log tarball I had noticed that mongodb logs in
> /var/log/syslog stopped for the period between 06:28:15 and 06:28:40.
> It's during this period where the machine-0 jujud logged tons of i/o
> timeout errors. It's almost like mongodb got stuck for a while or was
> heavily loaded for some reason (although I see no reason why this should
> be so).
>
> This gave me the idea to try pausing the mongod process using SIGSTOP.
> By doing this I'm able to cause jujud to experience the same i/o timeout
> errors. By pausing the mongod process at the right place inside a Juju's
> API handler I'm able to get the i/o timeouts errors at the client as
> reported. I suspect that for some reason mongod is occasionally getting
> stuck/busy for a while and this is what's cause the i/o timeout errors
> in OIL. MongoDB's leader election (which typically happens soon after
> bootstrapping the controller) are certainly another cause, but that's
> not happened in the the latest logs attached to this ticket.
>
> Being able to do this will speed up the development of a fix/workaround.
> I'd still like it to be tested within the OIL setup once I have it ready
> though.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1597601
>
> Title:
>   ERROR cannot deploy bundle: cannot deploy application: i/o timeout
>
> Status in juju:
>   In Progress
>
> Bug description:
>   Running this command:
>   juju-2.0 deploy -m maas:default artifacts/oil_deployment
>
>   I got this output:
>   added charm cs:trusty/cinder-253
>   ERROR cannot deploy bundle: cannot deploy application "cinder": cannot
> add application "cinder": read tcp 127.0.0.1:54084->127.0.0.1:37017: i/o
> timeout
>
>   I've attached the machine-0 log.
>
>   From this build in OIL:
>   http://10.245.162.43:8080/job/pipeline_deploy/6818/
>
>   This is with 2.0-beta10 and 2.0beta11.
>
>   We hit this a couple of times a day in OIL.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1597601/+subscriptions
>

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2016-09-19:

#32

OK, well we'll have to rely on my simulated reproduction then.

Can you tell me a little more about the context where you see the problem? Specifically:

1. Which on substrate(s) do you see the problem? (i.e. MAAS, LXD, public clouds, all of the above)

2. How many of these tests do you run a day and how many of those see this problem?

3. Do you happen to have CPU, memory and/or disk utilisation data for the Juju controller host(s) when the problem happens? Is this something you could collect?

Richard Harding (rharding) on 2016-09-20

Changed in juju:
milestone:	2.0-rc1 → 2.0-rc2

Revision history for this message

Menno Finlay-Smits (menno.smits) wrote on 2016-09-21:

#33

After battling against this for over 2 days straight without arriving at a workable fix, I've now gone for a more pragmatic approach: https://github.com/juju/juju/pull/6292

I've been able to show that this allows jujud to be resiliant to mongodb hanging for up to 60s. The logs provided showed mongodb hanging for 30-40s.

Changed in juju:
status:	In Progress → Fix Committed

Menno Finlay-Smits (menno.smits) on 2016-09-21

Changed in juju-core:
milestone:	none → 1.25.7
assignee:	nobody → Menno Smits (menno.smits)
importance:	Undecided → High
status:	New → In Progress

Menno Finlay-Smits (menno.smits) on 2016-09-21

Changed in juju-core:
status:	In Progress → Fix Committed

Curtis Hovey (sinzui) on 2016-09-29

Changed in juju:
status:	Fix Committed → Fix Released

Curtis Hovey (sinzui) on 2016-11-02

Changed in juju-core:
status:	Fix Committed → Fix Released

Canonical Juju

ERROR cannot deploy bundle: cannot deploy application: i/o timeout

Bug Description

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Fix Released	High	Menno Finlay-Smits	Canonical Juju 2.0-rc2
	juju-core	Fix Released	High	Menno Finlay-Smits	juju-core 1.25.7