running parallel bootstraps in aws gives request limit exceeded

Bug #1888409 reported by Adam Stokes
32
This bug affects 7 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Ben Hoyt

Bug Description

In our CI we have been running multiple juju bootstraps/deploys in parallel on AWS, usually in batches of 10 at a time. Prior to Juju 2.8.1 this has been working fine, once we upgraded to 2.8.1 we are now getting several AWS limiting errors such as:

ERROR Request limit exceeded. (RequestLimitExceeded)

And during deploys

ERROR cannot deploy bundle: cannot deploy application "kubernetes-master": cannot add application "kubernetes-master": RequestLimitExceeded: Request limit exceeded.

My questions are,

a) Did something change between releases that would cause these limit errors to now be surfaced?
b) Does juju do any sort of automatic retry/recovery when it runs into these errors?

description: updated
Revision history for this message
Ian Booth (wallyworld) wrote :

A recent change is that Juju now queries the AWS API to ask about available instance types and costs information rather than rely on hard coded data derived from a downloaded json file and baked into the juju binary.

The instance type info is cached so that it is only queried once by the controller per model (but it could also be queries in the juju client to validate constraints).

It's possible this is causing the rate limit exceeded issues. Juju doesn't explicitly handle the retries - it relies on the underlying cloud api library to do it under the covers, which is the case for Openstack for example (it's specific to each library how to interpret and react to such rate limiting requests). We'd need to look at what needs to be done to support retry/backoff for AWS.

Changed in juju:
milestone: none → 2.8-next
importance: Undecided → High
status: New → Triaged
Revision history for this message
Adam Stokes (adam-stokes) wrote :

As a workaround I could keep a controller around on each jenkins node, would that let me do parallel juju deploys without hitting this ratelimit error?

Revision history for this message
Ian Booth (wallyworld) wrote :

A quick investigation of the api requests shows that 2.8.1 makes fewer requests that 2.8.0 (at least when I bootstrapped).

In 2.8.1, the initial instance type info query is only done once (to a public api, not the same endpoint used to start instances etc).
2.8.0 does make more calls to fetch the network info from an instance.

So more investigation needed to understand why 2.8.1 gives the rate limit error

2.8.1

/?Action=DescribeAccountAttributes
/?Action=DescribeVpcs
/?Action=DescribeAvailabilityZones
/?Action=CreateSecurityGroup
/?Action=CreateTags
/?Action=AuthorizeSecurityGroupIngress
/?Action=CreateSecurityGroup
/?Action=CreateTags
/?Action=RunInstancesast-1a"
/?Action=CreateTags
/?Action=DescribeInstances
/?Action=DescribeInstances
/?Action=CreateTags
/?Action=DescribeInstances
/?Action=DescribeInstances
/?Action=DescribeInstances
/?Action=DescribeInstances

2.8.0

/?Action=DescribeAccountAttributes
/?Action=DescribeVpcs
/?Action=DescribeAvailabilityZones
/?Action=CreateSecurityGroup
/?Action=CreateTags
/?Action=AuthorizeSecurityGroupIngress
/?Action=CreateSecurityGroup
/?Action=CreateTags
/?Action=RunInstancesast-1a"
/?Action=CreateTags
/?Action=DescribeInstances
/?Action=DescribeInstances
/?Action=CreateTags
/?Action=DescribeInstances
/?Action=DescribeInstances
/?Action=DescribeInstances
/?Action=DescribeInstances
/?Action=DescribeInstances
/?Action=DescribeInstances
/?Action=DescribeInstances
/?Action=DescribeInstances

Revision history for this message
Ian Booth (wallyworld) wrote :

Using 2.8.1, running 10 concurrent bootstraps was not enough to get it to fail for me.
I had to go up to 20 concurrent bootstraps to get a single controller failure. The throttling does appear to be per region per account which could explain the difference in what it takes to trigger the issue.

The failed api was DescribeInstances, which we call a lot more often in 2.8.0. I can't see that this is a regression per se, but it is an issue none the less.

The http response was:

HTTP/1.1 503 Service Unavailable
Connection: close
Transfer-Encoding: chunked
Date: Mon, 10 Aug 2020 03:44:02 GMT
Server: AmazonEC2

e2
<?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>RequestLimitExceeded</Code><Message>Request limit exceeded.</Message></Error></Errors><RequestID>21cc3da1-6490-48ab-8b83-7af85d388225</RequestID></Response>
0

}

Sadly there's nothing in the error response to hint how long to wait before retrying like there is with openstack.

We're still using a really old (6 years) Go SDK for AWS - https://github.com/go-amz/amz/tree/v3

There's a newer official SDK https://github.com/aws/aws-sdk-go which has the ability to plug in a retry mechanism. Not sure how viable this would be to get into a 2.8 release.

Revision history for this message
Ian Booth (wallyworld) wrote :

I have added a retry backoff to the "legacy" go-amz library and can bootstrap 20 controllers in parallel. I can see from logs that I tend to get maybe 3 api calls out of the entire 20 bootstraps which need to be retried, and they succeed for me on the first retry.

Changed in juju:
milestone: 2.8-next → 2.8.3
assignee: nobody → Ian Booth (wallyworld)
status: Triaged → In Progress
Revision history for this message
Adam Stokes (adam-stokes) wrote :

Great news! One question, could you attempt to juju deploy charmed-kubernetes in parallel too? I am also seeing the request limit exceeded there, additionally I am also seeing limit exceeds when attempting to add security groups to newly added units. Adding and removing the affected unit then resolves that issue but maybe this is all part of the same issue?

Revision history for this message
Adam Stokes (adam-stokes) wrote :

Just another note I did switch back to the 2.7/stable channels and can run everything in parallel with no problems and tests are running as expected

Revision history for this message
Ian Booth (wallyworld) wrote :

It's all part of the same issue - all such queries to the EC2 APi go through the same funnel so adding retry to that one place should be enough.

I deployed charmed-kubernetes in parallel on 2 controllers with no issues.

I bootstrapped 2.7 and these APis were used

/?Action=DescribeAccountAttributes
/?Action=DescribeVpcs
/?Action=DescribeAvailabilityZones
/?Action=CreateSecurityGroup
/?Action=CreateTags
/?Action=AuthorizeSecurityGroupIngress
/?Action=CreateSecurityGroup
/?Action=CreateTags
/?Action=RunInstancesast-1a"
/?Action=CreateTags
/?Action=DescribeInstances
/?Action=DescribeInstances
/?Action=CreateTags
/?Action=DescribeInstances
/?Action=DescribeInstances
/?Action=DescribeInstances
/?Action=DescribeInstances
/?Action=DescribeInstances
/?Action=DescribeInstances
/?Action=DescribeInstances

So no real difference to 2.8.1
It is a mystery to me why you sre seeing such a difference between 2.7 and 2.8

Ian Booth (wallyworld)
Changed in juju:
milestone: 2.8.3 → 2.8.2
Ian Booth (wallyworld)
Changed in juju:
status: In Progress → Fix Committed
Revision history for this message
Cory Johns (johnsca) wrote :

It would be helpful to see a comparison of the API requests made to deploy CK or other non-trivial bundle on 2.7 vs 2.8, since a bootstrap only creates one machine while a deploy creates many and may make different requests than the bootstrap would.

Revision history for this message
Adam Stokes (adam-stokes) wrote :

Even with 2.8.2 from edge we're still hitting the ratelimit exceeded during deployments:

10:08:55 [validate-ck-amd64-xenial-1.18-edge] - add unit etcd/0 to new machine 1
10:09:04 [validate-ck-amd64-xenial-1.18-edge] ERROR cannot deploy bundle: cannot add unit for application "etcd": cannot assign unit "etcd/0" to machine: cannot assign unit "etcd/0" to new machine or container: cannot assign unit "etcd/0" to new machine: RequestLimitExceeded: Request limit exceeded.

Changed in juju:
status: Fix Committed → New
Revision history for this message
Adam Stokes (adam-stokes) wrote :

I put together a quick test script that shows what I see which basically mimics what we do in CI:

#!/bin/bash

set -x

uuid=$(uuidgen | tr '[:upper:]' '[:lower:]' | cut -f1 -d-)
juju bootstrap aws/us-east-1 controller-"$uuid" \
     --bootstrap-series focal \
     --force \
     --bootstrap-constraints arch="amd64" \
     --model-default test-mode=true \
     --model-default image-stream=daily \
     --model-default automatically-retry-hooks=false \
     --model-default logging-config="<root>=DEBUG"

juju deploy -m "controller-$uuid:default" \
     --force \
     --channel "edge" "cs:~containers/charmed-kubernetes"

juju-wait -e "controller-$uuid:default" -w

Put that in a file and make sure you have GNU parallel installed and run with:

> parallel --ungroup 'bash repro-juju.bash' ::: {1..10}

You'll start to see the request limit exceeded after the second or third deploy. On my system this runs with a max CPU of 4 so 4 jobs in parallel which should show the same result i see on your system

Revision history for this message
Ben Hoyt (benhoyt) wrote :

Hi Adam -- just an FYI that we've made some progress on this: I've been able to repro locally using your bash script with "parallel" (though I have 8 cores so it bootstrapped 8 in parallel). I saw a bunch of RequestLimitExceeded logs, which seems to be the same issue you are seeing.

We suspect this is happening because there are two AWS libraries being used in Juju: the legacy one for most EC2 calls, but the new AWS SDK for bootstrapping. A lot of AWS API calls are made during bootstrapping (and multiplied by the # of parallel bootstraps) to determine instance types / costs -- on the order of 300 requests in a few seconds when bootstrapping 8 controllers.

Ian added request retries to the legacy library. The AWS SDK has retries, but I believe they're not turned on by default, so all the "instance type" requests that happen on startup don't have retrying / exponential backoff enabled.

I'm going to spend some more time on this tomorrow to confirm the above and (hopefully) get in a fix that enables retrying for calls made from the AWS SDK library too.

Ben Hoyt (benhoyt)
Changed in juju:
milestone: 2.8.2 → 2.8.3
status: New → Fix Committed
Revision history for this message
Ben Hoyt (benhoyt) wrote :

I just committed a fix for this to the 2.8 branch (should go into 2.8.3, but will be on 2.8/edge shortly): https://github.com/juju/juju/pull/11975

It does two things:

1) Bumping up the number of retries for the AWS SDK from 3 short retries to 10 longish ones (similar to the retries we do with the legacy amz library). This fixes the problem with parallel deploys (I've tested locally).
2) Reducing the number of AWS API calls by 36% (by calling DescribeInstanceTypeOfferings with a page size of 1000 instead of 100).

Revision history for this message
Ben Hoyt (benhoyt) wrote :

Separately -- independently to fixing this issue -- I'm going investigate further reducing the number of AWS API calls we make. Even with this fix we do one set of "instance types" calls on the client per deploy (~25), and then one set per add-application call on the controller (in this case ~25*35=875 per controller).

John A Meinel (jameinel)
Changed in juju:
assignee: Ian Booth (wallyworld) → Ben Hoyt (benhoyt)
Revision history for this message
Adam Stokes (adam-stokes) wrote :

I can also verify that 2.8/edge works as well! Thank you!!

Ben Hoyt (benhoyt)
summary: - running parallel bootstraps in aws gives request limit execeeded
+ running parallel bootstraps in aws gives request limit exceeded
Revision history for this message
Ben Hoyt (benhoyt) wrote :

Just for the record, we also committed two additional PRs to reduce the number of API calls made for fetching instance information:

* https://github.com/juju/juju/pull/11982, which reduces the number of API calls made per "get instance types" call from 25 to 20
* https://github.com/juju/juju/pull/11988, which only makes two "get instance types" calls per model (instead of 35, for this test model!) -- this is a huge reduction

Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.