juju-core

juju does not retry provisioning against transient provider errors

Bug #1227450 reported by Kapil Thangavelu on 2013-09-19

This bug affects 10 people

Affects		Status	Importance	Assigned to	Milestone
	juju-core	Fix Released	Medium	Unassigned

Bug Description

Say i've exceeded some limit against my provider. juju will fail at provisioning. i then go to increase my limit at the provider, or remove unused resources. The condition was effectively just a transient error. Juju does not attempt to provision the resource again, the resources is a in a permanent error state in juju's perspective, even though a retry would succeed.

Tags:

Related branches

lp:~wallyworld/juju-core/provisioner-retry

Merged into lp:~go-bot/juju-core/trunk at revision 2486

Juju Engineering: Pending requested 2014-03-26

lp:~wallyworld/juju-core/machineswithtransienterrors-api

Merged into lp:~go-bot/juju-core/trunk at revision 2483

Juju Engineering: Pending requested 2014-03-26

lp:~wallyworld/juju-core/retryprovisioning-command

Merged into lp:~go-bot/juju-core/trunk at revision 2489

Juju Engineering: Pending requested 2014-03-26

Revision history for this message

John A Meinel (jameinel) wrote on 2013-09-19:

I believe the intent was that we could add a "juju resolved" command that would let you indicate to Juju that the error has been resolved and Juju should try again.

Changed in juju-core:
importance:	Undecided → Medium
status:	New → Triaged

Curtis Hovey (sinzui) on 2013-10-12

Changed in juju-core:
importance:	Medium → Low

Revision history for this message

Kapil Thangavelu (hazmat) wrote on 2013-10-12: Re: [Bug 1227450] Re: juju does not retry provisioning against transient provider errors

Does that assumes it wasn't some transient error on the provider end that
was out of user care or control. ie the cloud just barf'd.. Does the user
need to be responsible for that? ie. we do eventually consistent aws ops
because of eventual consistency.. which are effectively guards against
transient conditions at a provider, seems sane to extend.

On Thu, Sep 19, 2013 at 5:39 AM, John A Meinel <email address hidden>wrote:

> I believe the intent was that we could add a "juju resolved" command
> that would let you indicate to Juju that the error has been resolved and
> Juju should try again.
>
>
> ** Changed in: juju-core
> Importance: Undecided => Medium
>
> ** Changed in: juju-core
> Status: New => Triaged
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1227450
>
> Title:
> juju does not retry provisioning against transient provider errors
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju-core/+bug/1227450/+subscriptions
>

Revision history for this message

David Britton (dpb) wrote on 2014-02-20:

I get this more and more now, it seems:

  "2":
      agent-state-info: '(error: cannot set up groups: Request limit exceeded. (RequestLimitExceeded))'
      instance-id: pending
      series: precise

Which seems to be a good first place to put in retries with backoffs, since this error is the definition of transient.

Dean Henrichsmeyer (dean) on 2014-02-20

tags:

added: landscape

Revision history for this message

Peter Petrakis (peter-petrakis) wrote on 2014-03-04:

I'm getting on a repeatable basis in EC2 while trying to deploy a small 3 node hadoop cluster using juju-deployer.

Dropping deployer isn't an option.

machines:
  "0":
    agent-state: started
    agent-version: 1.17.4
    dns-name: ec2-50-18-247-146.us-west-1.compute.amazonaws.com
    instance-id: i-2091057c
    instance-state: running
    series: precise
    hardware: arch=amd64 cpu-cores=1 cpu-power=100 mem=1740M root-disk=8192M
  "1":
    agent-state-info: '(error: cannot set up groups: Request limit exceeded. (RequestLimitExceeded))'
    instance-id: pending
    series: precise
  "2":
    agent-state: started
    agent-version: 1.17.4
    dns-name: ec2-54-219-107-61.us-west-1.compute.amazonaws.com
    instance-id: i-819e0add
    instance-state: running
    series: precise
    hardware: arch=amd64 cpu-cores=1 cpu-power=100 mem=1740M root-disk=8192M
  "3":
    agent-state: started
    agent-version: 1.17.4
    dns-name: ec2-204-236-184-129.us-west-1.compute.amazonaws.com
    instance-id: i-ad9105f1
    instance-state: running
    series: precise
    hardware: arch=amd64 cpu-cores=1 cpu-power=100 mem=1740M root-disk=8192M
  "4":
    agent-state: started
    agent-version: 1.17.4
    dns-name: ec2-50-18-99-36.us-west-1.compute.amazonaws.com
    instance-id: i-ac9105f0
    instance-state: running
    series: precise
    hardware: arch=amd64 cpu-cores=1 cpu-power=100 mem=1740M root-disk=8192M
  "5":
    agent-state: started
    agent-version: 1.17.4
    dns-name: ec2-54-219-226-138.us-west-1.compute.amazonaws.com
    instance-id: i-58930704
    instance-state: running
    series: precise
    hardware: arch=amd64 cpu-cores=1 cpu-power=100 mem=1740M root-disk=8192M

Revision history for this message

Kapil Thangavelu (hazmat) wrote on 2014-03-04:

This is reported a half-dozen times with inconsistent importance levels, marking it high as it captures the problem generically.

Changed in juju-core:
importance:	Low → High

Curtis Hovey (sinzui) on 2014-03-06

Changed in juju-core:
milestone:	none → 1.18.0
tags:	added: charmers

Curtis Hovey (sinzui) on 2014-03-06

Changed in juju-core:
milestone:	1.18.0 → 1.17.5

Revision history for this message

Mark Ramm (mark-ramm) wrote on 2014-03-10:

I think this is important, but we *can* release 1.17.5 without it. We *should* however tackle this as soon as possible.

Changed in juju-core:
milestone:	1.17.5 → 1.18.0

Ian Booth (wallyworld) on 2014-03-24

Changed in juju-core:
assignee:	nobody → Ian Booth (wallyworld)
status:	Triaged → In Progress

Revision history for this message

Ian Booth (wallyworld) wrote on 2014-03-26:

For now, there's a manual command that can be used to trigger a retry of the provisioning:

juju retry-provisioning <machine> [...]

The arguments are a space separated list of machine ids.

The command can be used where, for example, we hit the rate limit exceeded error. The output of juju status will show which machine ids are affected.

Ian Booth (wallyworld) on 2014-03-26

Changed in juju-core:
milestone:	1.18.0 → 1.17.7
status:	In Progress → Fix Committed

Revision history for this message

Kapil Thangavelu (hazmat) wrote on 2014-03-27:

its a fine stop gap, but this is a poor scale out solution.

On Wed, Mar 26, 2014 at 5:21 PM, Ian Booth <email address hidden> wrote:

> ** Changed in: juju-core
> Milestone: 1.18.0 => 1.17.7
>
> ** Changed in: juju-core
> Status: In Progress => Fix Committed
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1227450
>
> Title:
> juju does not retry provisioning against transient provider errors
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju-core/+bug/1227450/+subscriptions
>

Revision history for this message

Ian Booth (wallyworld) wrote on 2014-03-27:

No argument there at all. But we had a very short time to get *something* done before releasing 1.18 and it at least allows manual intervention where required. This is not the end of it - more will be done but there's a lot of caveats to consider and we need to make sure we don't make the problem worse when retrying. Hence the cautious, inrcemental approach.

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2014-03-27:

#10

On Mar 27, 2014 2:25 AM, "Ian Booth" <email address hidden> wrote:
>
> No argument there at all. But we had a very short time to get
> *something* done before releasing 1.18 and it at least allows manual
> intervention where required. This is not the end of it - more will be
> done but there's a lot of caveats to consider and we need to make sure
> we don't make the problem worse when retrying. Hence the cautious,
> inrcemental approach.

How about being less aggressive towards the service? Something changed in
this area in 1.17.x when compared to 1.16.x. I just can't use 1.17.x with
aws, for example, this is triggered every time. Retrying won't cut down the
aggressiveness, and might even get one banned I figure.

Would this be better served as another bug?

Revision history for this message

Ian Booth (wallyworld) wrote on 2014-03-27:

#11

1.17 introduced a feature to poll instances for public address information. Unfortunately, this was done one instance at a time. The next Juju release coming within a day or so (1.17.7) addresses this issue by bulk polling instances which will significantly cut down on the number of requests and should alleviate the problem. Having said that, other factors may still result in transient provisioning errors so the feature is still needed.

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2014-03-27:

#12

On Thu, Mar 27, 2014 at 8:49 AM, Ian Booth <email address hidden> wrote:
> 1.17 introduced a feature to poll instances for public address
> information. Unfortunately, this was done one instance at a time. The
> next Juju release coming within a day or so (1.17.7) addresses this
> issue by bulk polling instances which will significantly cut down on the
> number of requests and should alleviate the problem. Having said that,

\o/ nice to hear that :)

Revision history for this message

William Reade (fwereade) wrote on 2014-03-27:

#13

Reopened; created https://bugs.launchpad.net/juju-core/+bug/1298435 to track what *was* done.

Changed in juju-core:
status:	Fix Committed → Triaged
milestone:	1.17.7 → 2.0

Curtis Hovey (sinzui) on 2014-05-12

Changed in juju-core:
assignee:	Ian Booth (wallyworld) → nobody

Curtis Hovey (sinzui) on 2014-05-12

Changed in juju-core:
milestone:	none → next-stable

Curtis Hovey (sinzui) on 2014-05-12

tags:

added: reliability

Curtis Hovey (sinzui) on 2014-12-03

Changed in juju-core:
milestone:	1.21 → 1.22

Curtis Hovey (sinzui) on 2015-01-07

Changed in juju-core:
milestone:	1.22 → 1.23

Curtis Hovey (sinzui) on 2015-02-10

Changed in juju-core:
milestone:	1.23 → none
importance:	High → Medium

Jorge Castro (jorge) on 2015-08-13

tags:

added: adoption

Revision history for this message

Anastasia (anastasia-macmood) wrote on 2016-10-19:

#14

There has been a lot changes in provisioning and re-trying provisioning since this bug was filed.
Both Juju 1.25 and Juju 2 now behave very differently.

We believe that this failure has been fixed as part of the re-work.

Changed in juju-core:
status:	Triaged → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.