juju does not retry provisioning against transient provider errors

Bug #1227450 reported by Kapil Thangavelu
60
This bug affects 10 people
Affects Status Importance Assigned to Milestone
juju-core
Fix Released
Medium
Unassigned

Bug Description

Say i've exceeded some limit against my provider. juju will fail at provisioning. i then go to increase my limit at the provider, or remove unused resources. The condition was effectively just a transient error. Juju does not attempt to provision the resource again, the resources is a in a permanent error state in juju's perspective, even though a retry would succeed.

Related branches

Revision history for this message
John A Meinel (jameinel) wrote :

I believe the intent was that we could add a "juju resolved" command that would let you indicate to Juju that the error has been resolved and Juju should try again.

Changed in juju-core:
importance: Undecided → Medium
status: New → Triaged
Curtis Hovey (sinzui)
Changed in juju-core:
importance: Medium → Low
Revision history for this message
Kapil Thangavelu (hazmat) wrote : Re: [Bug 1227450] Re: juju does not retry provisioning against transient provider errors

Does that assumes it wasn't some transient error on the provider end that
was out of user care or control. ie the cloud just barf'd.. Does the user
need to be responsible for that? ie. we do eventually consistent aws ops
because of eventual consistency.. which are effectively guards against
transient conditions at a provider, seems sane to extend.

On Thu, Sep 19, 2013 at 5:39 AM, John A Meinel <email address hidden>wrote:

> I believe the intent was that we could add a "juju resolved" command
> that would let you indicate to Juju that the error has been resolved and
> Juju should try again.
>
>
> ** Changed in: juju-core
> Importance: Undecided => Medium
>
> ** Changed in: juju-core
> Status: New => Triaged
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1227450
>
> Title:
> juju does not retry provisioning against transient provider errors
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju-core/+bug/1227450/+subscriptions
>

Revision history for this message
David Britton (dpb) wrote :

I get this more and more now, it seems:

  "2":
      agent-state-info: '(error: cannot set up groups: Request limit exceeded. (RequestLimitExceeded))'
      instance-id: pending
      series: precise

Which seems to be a good first place to put in retries with backoffs, since this error is the definition of transient.

tags: added: landscape
Revision history for this message
Peter Petrakis (peter-petrakis) wrote :

I'm getting on a repeatable basis in EC2 while trying to deploy a small 3 node hadoop cluster using juju-deployer.

Dropping deployer isn't an option.

machines:
  "0":
    agent-state: started
    agent-version: 1.17.4
    dns-name: ec2-50-18-247-146.us-west-1.compute.amazonaws.com
    instance-id: i-2091057c
    instance-state: running
    series: precise
    hardware: arch=amd64 cpu-cores=1 cpu-power=100 mem=1740M root-disk=8192M
  "1":
    agent-state-info: '(error: cannot set up groups: Request limit exceeded. (RequestLimitExceeded))'
    instance-id: pending
    series: precise
  "2":
    agent-state: started
    agent-version: 1.17.4
    dns-name: ec2-54-219-107-61.us-west-1.compute.amazonaws.com
    instance-id: i-819e0add
    instance-state: running
    series: precise
    hardware: arch=amd64 cpu-cores=1 cpu-power=100 mem=1740M root-disk=8192M
  "3":
    agent-state: started
    agent-version: 1.17.4
    dns-name: ec2-204-236-184-129.us-west-1.compute.amazonaws.com
    instance-id: i-ad9105f1
    instance-state: running
    series: precise
    hardware: arch=amd64 cpu-cores=1 cpu-power=100 mem=1740M root-disk=8192M
  "4":
    agent-state: started
    agent-version: 1.17.4
    dns-name: ec2-50-18-99-36.us-west-1.compute.amazonaws.com
    instance-id: i-ac9105f0
    instance-state: running
    series: precise
    hardware: arch=amd64 cpu-cores=1 cpu-power=100 mem=1740M root-disk=8192M
  "5":
    agent-state: started
    agent-version: 1.17.4
    dns-name: ec2-54-219-226-138.us-west-1.compute.amazonaws.com
    instance-id: i-58930704
    instance-state: running
    series: precise
    hardware: arch=amd64 cpu-cores=1 cpu-power=100 mem=1740M root-disk=8192M

Revision history for this message
Kapil Thangavelu (hazmat) wrote :

This is reported a half-dozen times with inconsistent importance levels, marking it high as it captures the problem generically.

Changed in juju-core:
importance: Low → High
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: none → 1.18.0
tags: added: charmers
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.18.0 → 1.17.5
Revision history for this message
Mark Ramm (mark-ramm) wrote :

I think this is important, but we *can* release 1.17.5 without it. We *should* however tackle this as soon as possible.

Changed in juju-core:
milestone: 1.17.5 → 1.18.0
Ian Booth (wallyworld)
Changed in juju-core:
assignee: nobody → Ian Booth (wallyworld)
status: Triaged → In Progress
Revision history for this message
Ian Booth (wallyworld) wrote :

For now, there's a manual command that can be used to trigger a retry of the provisioning:

juju retry-provisioning <machine> [...]

The arguments are a space separated list of machine ids.

The command can be used where, for example, we hit the rate limit exceeded error. The output of juju status will show which machine ids are affected.

Ian Booth (wallyworld)
Changed in juju-core:
milestone: 1.18.0 → 1.17.7
status: In Progress → Fix Committed
Revision history for this message
Kapil Thangavelu (hazmat) wrote :

its a fine stop gap, but this is a poor scale out solution.

On Wed, Mar 26, 2014 at 5:21 PM, Ian Booth <email address hidden> wrote:

> ** Changed in: juju-core
> Milestone: 1.18.0 => 1.17.7
>
> ** Changed in: juju-core
> Status: In Progress => Fix Committed
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1227450
>
> Title:
> juju does not retry provisioning against transient provider errors
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju-core/+bug/1227450/+subscriptions
>

Revision history for this message
Ian Booth (wallyworld) wrote :

No argument there at all. But we had a very short time to get *something* done before releasing 1.18 and it at least allows manual intervention where required. This is not the end of it - more will be done but there's a lot of caveats to consider and we need to make sure we don't make the problem worse when retrying. Hence the cautious, inrcemental approach.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

On Mar 27, 2014 2:25 AM, "Ian Booth" <email address hidden> wrote:
>
> No argument there at all. But we had a very short time to get
> *something* done before releasing 1.18 and it at least allows manual
> intervention where required. This is not the end of it - more will be
> done but there's a lot of caveats to consider and we need to make sure
> we don't make the problem worse when retrying. Hence the cautious,
> inrcemental approach.

How about being less aggressive towards the service? Something changed in
this area in 1.17.x when compared to 1.16.x. I just can't use 1.17.x with
aws, for example, this is triggered every time. Retrying won't cut down the
aggressiveness, and might even get one banned I figure.

Would this be better served as another bug?

Revision history for this message
Ian Booth (wallyworld) wrote :

1.17 introduced a feature to poll instances for public address information. Unfortunately, this was done one instance at a time. The next Juju release coming within a day or so (1.17.7) addresses this issue by bulk polling instances which will significantly cut down on the number of requests and should alleviate the problem. Having said that, other factors may still result in transient provisioning errors so the feature is still needed.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

On Thu, Mar 27, 2014 at 8:49 AM, Ian Booth <email address hidden> wrote:
> 1.17 introduced a feature to poll instances for public address
> information. Unfortunately, this was done one instance at a time. The
> next Juju release coming within a day or so (1.17.7) addresses this
> issue by bulk polling instances which will significantly cut down on the
> number of requests and should alleviate the problem. Having said that,

 \o/ nice to hear that :)

Revision history for this message
William Reade (fwereade) wrote :

Reopened; created https://bugs.launchpad.net/juju-core/+bug/1298435 to track what *was* done.

Changed in juju-core:
status: Fix Committed → Triaged
milestone: 1.17.7 → 2.0
Curtis Hovey (sinzui)
Changed in juju-core:
assignee: Ian Booth (wallyworld) → nobody
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: none → next-stable
Curtis Hovey (sinzui)
tags: added: reliability
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.21 → 1.22
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.22 → 1.23
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.23 → none
importance: High → Medium
Jorge Castro (jorge)
tags: added: adoption
Revision history for this message
Anastasia (anastasia-macmood) wrote :

There has been a lot changes in provisioning and re-trying provisioning since this bug was filed.
Both Juju 1.25 and Juju 2 now behave very differently.

We believe that this failure has been fixed as part of the re-work.

Changed in juju-core:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.