ERROR cannot deploy bundle: cannot deploy application: i/o timeout

Bug #1597601 reported by Jason Hobbs
24
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Menno Finlay-Smits
juju-core
Fix Released
High
Menno Finlay-Smits

Bug Description

Running this command:
juju-2.0 deploy -m maas:default artifacts/oil_deployment

I got this output:
added charm cs:trusty/cinder-253
ERROR cannot deploy bundle: cannot deploy application "cinder": cannot add application "cinder": read tcp 127.0.0.1:54084->127.0.0.1:37017: i/o timeout

I've attached the machine-0 log.

From this build in OIL: http://10.245.162.43:8080/job/pipeline_deploy/6818/

This is with 2.0-beta10 and 2.0beta11.

We hit this a couple of times a day in OIL.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
description: updated
Revision history for this message
Cheryl Jennings (cherylj) wrote :

Jason, can you also attach /var/log/syslog from the controller machine?

Changed in juju-core:
status: New → Incomplete
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

I recreated this, and have attached /var/log from the controller machine.

Changed in juju-core:
status: Incomplete → New
description: updated
Revision history for this message
Cheryl Jennings (cherylj) wrote :

Jason - would you be able to provide the bundle being deployed to see if we can recreate?

Changed in juju-core:
status: New → Triaged
importance: Undecided → High
milestone: none → 2.0-beta13
tags: added: bundles
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Cheryl,

Here is the bundle that was used in the deployment I captured the bootstrap node logs from.

http://paste.ubuntu.com/19286457/

It doesn't happen everytime I deploy with this bundle - it's a sporadic failure, so you may not be able to reproduce it easily.

tags: added: oil-2.0
tags: added: 2.0
Curtis Hovey (sinzui)
tags: added: deploy repeatability
tags: added: retry
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta13 → 2.0-beta14
Changed in juju-core:
assignee: nobody → Menno Smits (menno.smits)
Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

I pretty sure what's happening is that "juju enable-ha" is being run at some point and then the deploy continues. As each new controller node created by enable-ha comes up and joins Juju's MongoDB replicaset, a MongoDB leader election is triggered which causes each mongod to drop it's connections (this is by design).

MongoDB's connections dropping causes the controller agent's workers to restart and reconnect as they should but if an API request comes in just the right time it may be actioned just as the MongoDB connection goes down. This results in the i/o timeout error.

MongoDB's logs are in /var/log/syslog. Here's what mongod on machine-0 logged as a new replicaset member joined:

Jul 13 09:50:59 ubuntu mongod.37017[4593]: [ReplicationExecutor] transition to RECOVERING
Jul 13 09:50:59 ubuntu mongod.37017[4593]: [ReplicationExecutor] transition to SECONDARY
Jul 13 09:50:59 ubuntu mongod.37017[4593]: [ReplicationExecutor] conducting a dry run election to see if we could be elected
Jul 13 09:50:59 ubuntu mongod.37017[4593]: [ReplicationExecutor] dry election run succeeded, running for election
Jul 13 09:50:59 ubuntu mongod.37017[4593]: [ReplicationExecutor] election succeeded, assuming primary role in term 1
Jul 13 09:50:59 ubuntu mongod.37017[4593]: [ReplicationExecutor] transition to PRIMARY

The timestamp correspond to when connections and workers are shut down in the machine agent, and when the i/o timeout error occurred.

The immediate workaround is wait for the new controller machines to come up after "juju enable-ha" is issued, before continuing with the deployment (being aware that the i/o timeout might occur while status is polled).

Ideally, Juju would hide all this from the user but I'm not sure what the right approach should be. The fix is likely to be non-trivial and is unlikely to happen before 2.0. This problem isn't new. I'll take this to the juju-dev list for comment.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote : Re: [Bug 1597601] Re: ERROR cannot deploy bundle: cannot deploy application: i/o timeout

In OIL, we're never running enable-ha (unless this happens implicitly by
doing a bootstrap/deploy) and we hit this regularly.

On Tue, Jul 26, 2016 at 1:57 AM, Menno Smits <email address hidden>
wrote:

> I pretty sure what's happening is that "juju enable-ha" is being run at
> some point and then the deploy continues. As each new controller node
> created by enable-ha comes up and joins Juju's MongoDB replicaset, a
> MongoDB leader election is triggered which causes each mongod to drop
> it's connections (this is by design).
>
> MongoDB's connections dropping causes the controller agent's workers to
> restart and reconnect as they should but if an API request comes in just
> the right time it may be actioned just as the MongoDB connection goes
> down. This results in the i/o timeout error.
>
> MongoDB's logs are in /var/log/syslog. Here's what mongod on machine-0
> logged as a new replicaset member joined:
>
> Jul 13 09:50:59 ubuntu mongod.37017[4593]: [ReplicationExecutor]
> transition to RECOVERING
> Jul 13 09:50:59 ubuntu mongod.37017[4593]: [ReplicationExecutor]
> transition to SECONDARY
> Jul 13 09:50:59 ubuntu mongod.37017[4593]: [ReplicationExecutor]
> conducting a dry run election to see if we could be elected
> Jul 13 09:50:59 ubuntu mongod.37017[4593]: [ReplicationExecutor] dry
> election run succeeded, running for election
> Jul 13 09:50:59 ubuntu mongod.37017[4593]: [ReplicationExecutor] election
> succeeded, assuming primary role in term 1
> Jul 13 09:50:59 ubuntu mongod.37017[4593]: [ReplicationExecutor]
> transition to PRIMARY
>
> The timestamp correspond to when connections and workers are shut down
> in the machine agent, and when the i/o timeout error occurred.
>
> The immediate workaround is wait for the new controller machines to come
> up after "juju enable-ha" is issued, before continuing with the
> deployment (being aware that the i/o timeout might occur while status is
> polled).
>
> Ideally, Juju would hide all this from the user but I'm not sure what
> the right approach should be. The fix is likely to be non-trivial and is
> unlikely to happen before 2.0. This problem isn't new. I'll take this to
> the juju-dev list for comment.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1597601
>
> Title:
> ERROR cannot deploy bundle: cannot deploy application: i/o timeout
>
> Status in juju-core:
> Triaged
>
> Bug description:
> Running this command:
> juju-2.0 deploy -m maas:default artifacts/oil_deployment
>
> I got this output:
> added charm cs:trusty/cinder-253
> ERROR cannot deploy bundle: cannot deploy application "cinder": cannot
> add application "cinder": read tcp 127.0.0.1:54084->127.0.0.1:37017: i/o
> timeout
>
> I've attached the machine-0 log.
>
> From this build in OIL:
> http://10.245.162.43:8080/job/pipeline_deploy/6818/
>
> This is with 2.0-beta10 and 2.0beta11.
>
> We hit this a couple of times a day in OIL.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju-core/+bug/1597601/+subscriptions
>

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

Ok, that's good know. It looks like MongoDB will do these leader elections even when there's 1 member. That rings a bell now that I think about it. It happens soon after MongoDB has come up.

I'm bringing this topic up with the Juju tech board tonight.

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

@jason-hobbs: It's looking like that it's not going to feasible to prevent an error bubbling up to the client when MongoDB performs a leader election. What we can do is translate that error into a clear, easily identifiable error which indicates that the request should be retried. Would this be acceptable to you?

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Menno,

Why can't juju retry if it's retryable? This sounds like a condition internal to juju/mongodb, not something external about the environment I need to fix as an operator. I don't want to have to write a wrapper around juju's cli to do what juju should be doing itself, which is deploying what I asked for unless there is a fatal error caused by something in the environment that is beyond juju's control.

Jason

Revision history for this message
Richard Harding (rharding) wrote :

Menno, I have to agree with Jason. Can you clarify the 'feasible'. What I could see is that every write to mongo has to be packed with a retry mechanism in order to make this completely invisible. That's an insane amount of work, however, it seems that we'd be able to target specific operations that users perform and takes time/expects to work such as deploying a bundle and the operations contained in there, adding an application (where this appears to have failed) and at least start to set an example in the code.

I'd love to get more details as to the tech board's thoughts on how to prevent this longer term. If I'm mistaken, and the size of the exposed surface of the problem isn't what makes it not feasible, I'd be curious what that is and if there are other ways we can make a dent in the problem.

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

Sorry, I wasn't clear. The operator wouldn't have to retry - the API client (in this case the juju CLI) would though. There would be no pain from the operator's perspective as this would be handled transparently but it would mean that every API client would need to be prepared to handle the "please retry" error.

@jason-hobbs: My question was more, "would it be acceptable for the API server to return a 'please retry' error?" but if you're not consuming the API directly then I guess it doesn't really matter to you. It will matter to the juju GUI, landscape teams etc. I'll take this to the juju-dev list.

Revision history for this message
Richard Harding (rharding) wrote :

Thanks for the clarification Menno. It still seems that there's some code in Juju that is talking to mongo as a remote data service that could perform a retry much as we would against a cloud api. If we cannot though, the next best thing is the client retrying for the user in a controlled fashion.

If/when we head down that path we need to make sure to establish a pattern of making sure things are clear that we know a retry should be handled here by clients, much like noting this can throw an error.

Changed in juju-core:
assignee: Menno Smits (menno.smits) → nobody
Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

Rick: we do indeed talk to MongoDB over an API but the machine agent workers (including the API server) are all arranged in such a way that they stopped when the connection to MongoDB dies. This provides a robust way for workers to reconfigure themselves when the MongoDB master changes, and also just a way to "start from scratch" when MongoDB has had a problem.

Unfortunately, this behaviour can also bubble up to the client.

The Juju controllers used to use MongoDB's idea of a master to drive where certain controller based workers ran. In this case it was particularly useful to have all workers which ultimately rely on a connection to MongoDB be restarted when the connection dropped due to a leader election. We're moving away from using the MongoDB master for this now but I believe there's still one or two places where is matters.

I think that Will might some useful input on all this so I'm going to poke him about it.

Changed in juju-core:
assignee: nobody → Alexis Bruemmer (alexis-bruemmer)
Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

I need to refocus on other things so I'm summarising what I think needs to be done at a high level, if we go with the "please retry" error approach.

1. A new API error type needs to implemented which indicates to the client that the last request should be retried.
2. The API server needs to translate MongoDB "i/o timeout" errors into this new retry error.
3. The client needs to be updated (hopefully in just one place) to transparently retry when the retry error is seen.

I've assigned to Alexis so this ticket doesn't get forgotten.

Tim Penhey (thumper)
Changed in juju-core:
assignee: Alexis Bruemmer (alexis-bruemmer) → Tim Penhey (thumper)
status: Triaged → In Progress
Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

This issue was discussed further at last night's tech board meeting. It was decided that returning a "please retry" error for the client to handle is a good immediate measure. It makes the situation better than it is now and the concept of such an error is useful to have available.

It was also decided that we should look to rearrange the API server so it doesn't have to restart when the connection to MongoDB dies. This will be a significant effort which we will track as a tech debt item (adding to the board now).

Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta14 → 2.0-beta15
Revision history for this message
Tim Penhey (thumper) wrote :

Although I wasn't able to reproduce this problem, we think it is now fixed.

Changed in juju-core:
status: In Progress → Fix Committed
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

We're hitting this in beta 15.

Changed in juju-core:
status: Fix Released → New
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Here's the run in OIL we hit this on (internal IP): http://10.245.162.43:8080/job/pipeline_deploy/9178/console

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Logs from the bootstrap node from the failure on August 22nd.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Logs from another run (9153) on the 22nd that hit this bug.

Changed in juju-core:
status: New → Triaged
milestone: 2.0-beta15 → 2.0-beta16
Revision history for this message
Tim Penhey (thumper) wrote :

@Jason, is there a way we can get debug logs? The default output that we get in the machine-0.log is not good enough to help debug this.

Are you able to set the logging config on the controller?

Preferrably 'juju=DEBUG'. You can get this by bootstrapping using the --debug option, or later by calling 'juju set-model-config logging-config=juju=DEBUG'

affects: juju-core → juju
Changed in juju:
milestone: 2.0-beta16 → none
milestone: none → 2.0-beta16
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Yes Tim, we'll enable debug logging.

Jason

On Mon, Aug 22, 2016 at 7:08 PM, Tim Penhey <email address hidden>
wrote:

> @Jason, is there a way we can get debug logs? The default output that
> we get in the machine-0.log is not good enough to help debug this.
>
> Are you able to set the logging config on the controller?
>
> Preferrably 'juju=DEBUG'. You can get this by bootstrapping using the
> --debug option, or later by calling 'juju set-model-config logging-
> config=juju=DEBUG'
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1597601
>
> Title:
> ERROR cannot deploy bundle: cannot deploy application: i/o timeout
>
> Status in juju-core:
> Triaged
>
> Bug description:
> Running this command:
> juju-2.0 deploy -m maas:default artifacts/oil_deployment
>
> I got this output:
> added charm cs:trusty/cinder-253
> ERROR cannot deploy bundle: cannot deploy application "cinder": cannot
> add application "cinder": read tcp 127.0.0.1:54084->127.0.0.1:37017: i/o
> timeout
>
> I've attached the machine-0 log.
>
> From this build in OIL:
> http://10.245.162.43:8080/job/pipeline_deploy/6818/
>
> This is with 2.0-beta10 and 2.0beta11.
>
> We hit this a couple of times a day in OIL.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju-core/+bug/1597601/+subscriptions
>

Revision history for this message
Larry Michel (lmic) wrote :

Here's from latest recreate with debug enabled.

Curtis Hovey (sinzui)
Changed in juju:
milestone: 2.0-beta16 → 2.0-beta17
Revision history for this message
Tim Penhey (thumper) wrote :

OK, think I have it this time. Proposing fix.

Changed in juju:
status: Triaged → In Progress
Tim Penhey (thumper)
Changed in juju:
status: In Progress → Fix Committed
Curtis Hovey (sinzui)
Changed in juju:
status: Fix Committed → Fix Released
Revision history for this message
Larry Michel (lmic) wrote :

We were able to recreate with Beta18.

2016-09-15 06:28:01,642 [INFO] oil_ci.juju.juju2: deploying: juju-2.0 deploy -m ci-oil-slave9:default artifacts/oil_deployment
Deploying charm "cs:trusty/ceph-mon-3"
ERROR cannot deploy bundle: cannot deploy application "ceph-mon": cannot add application "ceph-mon": read tcp 10.244.241.42:41782->10.244.241.42:37017: i/o timeout

Log from bootstrap node attached.

Changed in juju:
status: Fix Released → New
Tim Penhey (thumper)
Changed in juju:
milestone: 2.0-beta17 → 2.0-rc1
Changed in juju:
assignee: Tim Penhey (thumper) → Menno Smits (menno.smits)
Changed in juju:
status: New → Triaged
Changed in juju:
status: Triaged → In Progress
Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

I can see what the problem is. The retry logic only kicks in if the entire API request returns a retry error but since almost all of Juju's APIs accept multiple entities it's possible for some parts of the API request to succeed while others fail. A failure for the work done in an API handler for a given entity is much more likely.

I have a fix in mind but is non-trivial and has a risk of introducing more problems. I've been working on that today.

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

I've been unable to reproduce the problem, even when using the same bundle you are. The best I've been able to do is artificially simulate the issue by modifying Juju to randomly generate the appropriate error. This isn't quite the same though makes it hard to know whether any fix I come up with will work.

Jason and Larry: If I provide you with a custom Juju build, would it be possible for you to run your tests with it? I'd be a lot happier merging the fixes I'm thinking about knowing that they've worked for you without introducing other problems.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Menno - I can test with another build given instructions to do so - let me
know!

Jason

On Mon, Sep 19, 2016 at 12:12 AM, Menno Smits <email address hidden>
wrote:

> I've been unable to reproduce the problem, even when using the same
> bundle you are. The best I've been able to do is artificially simulate
> the issue by modifying Juju to randomly generate the appropriate error.
> This isn't quite the same though makes it hard to know whether any fix I
> come up with will work.
>
> Jason and Larry: If I provide you with a custom Juju build, would it be
> possible for you to run your tests with it? I'd be a lot happier merging
> the fixes I'm thinking about knowing that they've worked for you without
> introducing other problems.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1597601
>
> Title:
> ERROR cannot deploy bundle: cannot deploy application: i/o timeout
>
> Status in juju:
> In Progress
>
> Bug description:
> Running this command:
> juju-2.0 deploy -m maas:default artifacts/oil_deployment
>
> I got this output:
> added charm cs:trusty/cinder-253
> ERROR cannot deploy bundle: cannot deploy application "cinder": cannot
> add application "cinder": read tcp 127.0.0.1:54084->127.0.0.1:37017: i/o
> timeout
>
> I've attached the machine-0 log.
>
> From this build in OIL:
> http://10.245.162.43:8080/job/pipeline_deploy/6818/
>
> This is with 2.0-beta10 and 2.0beta11.
>
> We hit this a couple of times a day in OIL.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1597601/+subscriptions
>

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

I can now reproduce the problem more accurately now, if still somewhat artificially.

In the latest log tarball I had noticed that mongodb logs in /var/log/syslog stopped for the period between 06:28:15 and 06:28:40. It's during this period where the machine-0 jujud logged tons of i/o timeout errors. It's almost like mongodb got stuck for a while or was heavily loaded for some reason (although I see no reason why this should be so).

This gave me the idea to try pausing the mongod process using SIGSTOP. By doing this I'm able to cause jujud to experience the same i/o timeout errors. By pausing the mongod process at the right place inside a Juju's API handler I'm able to get the i/o timeouts errors at the client as reported. I suspect that for some reason mongod is occasionally getting stuck/busy for a while and this is what's cause the i/o timeout errors in OIL. MongoDB's leader election (which typically happens soon after bootstrapping the controller) are certainly another cause, but that's not happened in the the latest logs attached to this ticket.

Being able to do this will speed up the development of a fix/workaround. I'd still like it to be tested within the OIL setup once I have it ready though.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Hey Menno,

Sorry, I was confused on which bug this was - we won't be able to take non
released builds for this. We hit this running in our general pipeline - it
takes many deployments to ever reproduce it - and we only run released
builds in that.

Jason

On Mon, Sep 19, 2016 at 5:33 AM, Menno Smits <email address hidden>
wrote:

> I can now reproduce the problem more accurately now, if still somewhat
> artificially.
>
> In the latest log tarball I had noticed that mongodb logs in
> /var/log/syslog stopped for the period between 06:28:15 and 06:28:40.
> It's during this period where the machine-0 jujud logged tons of i/o
> timeout errors. It's almost like mongodb got stuck for a while or was
> heavily loaded for some reason (although I see no reason why this should
> be so).
>
> This gave me the idea to try pausing the mongod process using SIGSTOP.
> By doing this I'm able to cause jujud to experience the same i/o timeout
> errors. By pausing the mongod process at the right place inside a Juju's
> API handler I'm able to get the i/o timeouts errors at the client as
> reported. I suspect that for some reason mongod is occasionally getting
> stuck/busy for a while and this is what's cause the i/o timeout errors
> in OIL. MongoDB's leader election (which typically happens soon after
> bootstrapping the controller) are certainly another cause, but that's
> not happened in the the latest logs attached to this ticket.
>
> Being able to do this will speed up the development of a fix/workaround.
> I'd still like it to be tested within the OIL setup once I have it ready
> though.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1597601
>
> Title:
> ERROR cannot deploy bundle: cannot deploy application: i/o timeout
>
> Status in juju:
> In Progress
>
> Bug description:
> Running this command:
> juju-2.0 deploy -m maas:default artifacts/oil_deployment
>
> I got this output:
> added charm cs:trusty/cinder-253
> ERROR cannot deploy bundle: cannot deploy application "cinder": cannot
> add application "cinder": read tcp 127.0.0.1:54084->127.0.0.1:37017: i/o
> timeout
>
> I've attached the machine-0 log.
>
> From this build in OIL:
> http://10.245.162.43:8080/job/pipeline_deploy/6818/
>
> This is with 2.0-beta10 and 2.0beta11.
>
> We hit this a couple of times a day in OIL.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1597601/+subscriptions
>

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

OK, well we'll have to rely on my simulated reproduction then.

Can you tell me a little more about the context where you see the problem? Specifically:

1. Which on substrate(s) do you see the problem? (i.e. MAAS, LXD, public clouds, all of the above)

2. How many of these tests do you run a day and how many of those see this problem?

3. Do you happen to have CPU, memory and/or disk utilisation data for the Juju controller host(s) when the problem happens? Is this something you could collect?

Changed in juju:
milestone: 2.0-rc1 → 2.0-rc2
Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

After battling against this for over 2 days straight without arriving at a workable fix, I've now gone for a more pragmatic approach: https://github.com/juju/juju/pull/6292

I've been able to show that this allows jujud to be resiliant to mongodb hanging for up to 60s. The logs provided showed mongodb hanging for 30-40s.

Changed in juju:
status: In Progress → Fix Committed
Changed in juju-core:
milestone: none → 1.25.7
assignee: nobody → Menno Smits (menno.smits)
importance: Undecided → High
status: New → In Progress
Changed in juju-core:
status: In Progress → Fix Committed
Curtis Hovey (sinzui)
Changed in juju:
status: Fix Committed → Fix Released
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.