uniter: high frequency relation operations cause rate limits to be exceeded

Bug #1244766 reported by Dave Cheney
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju-core
Fix Released
High
John A Meinel

Bug Description

Creating a charm which causes relation operations to happen at a high rate will cause too many calls to the provider and hit rate limits.

To reproduce:

deploy this charm, https://github.com/davecheney/charms/tree/master/saucy/xplod

juju deploy --repository=$HOME/charms local:saucy/xplod x1
juju deploy --repository=$HOME/charms local:saucy/xplod x2

 juju add-relation x1:input x2:output
 juju add-relation x2:input x1:output

once x1 and x2 are deployed, juju debug-log will show that every 5 seconds (api server watcher polling limit) the relation-changed hook fires on each unit.

now, increase the number of units

juju add-unit -n5 x1
juju add-unit -n5 x2

Then watch the log, uniters on will start to fail with errors from the API server related to provider rate limiting.

At this point, it will be quite difficult to recover as your ec2 credentials are also used for the aws console, which makes it hard to manually terminate the environment.

Why does the uniter make calls to the provider for this simple relation changed hook ?

% cat xplod/hooks/input-relation-changed
#!/bin/bash

set -xe

V=$(relation-get v)
V=${V:-0}
relation-set v=$(expr $V + 1)

Curtis Hovey (sinzui)
Changed in juju-core:
status: New → Triaged
importance: Undecided → High
tags: added: rate-limit
tags: added: performance
Revision history for this message
John A Meinel (jameinel) wrote :

I haven't dug into this deeply, but I'm guessing the problem is something like "what is the IP address for one end of the relation" which ends up calling Instance.DNSName which involves an API call to the ec2 provider.

Right now I'm debugging the case where every unit that starts up ends up triggering 2 calls (one for the StateAddress and one for the APIAddress). This may or may not actually help, but it is probably a place to start looking.

Revision history for this message
John A Meinel (jameinel) wrote :

I do see this like in worker/uniter/uniter.go runHook():
apiAddrs, err := u.st.APIAddresses()

And in the current implementation APIAddresses goes all the way to the provider-state file to create an Environment, do an API lookup to get the Address. (Which is extra funny given that u.st is actually already an API connection).

But yes, every hook execution will trigger at *least* one EC2 API call. We should fix the Addresses code to just extract the addresses from state (see bug #1205371).

Revision history for this message
John A Meinel (jameinel) wrote :

With the recent changes from Roger and Martin to cache the APIAddress content in the database, and my change to make Uniter use the cached content this bug *should* be fixed. I still need to actually test it, but with trunk today we shouldn't see as many calls to Environ information just to run hooks.

Changed in juju-core:
assignee: nobody → John A Meinel (jameinel)
milestone: none → 1.17.0
status: Triaged → In Progress
Revision history for this message
John A Meinel (jameinel) wrote :

Note the related bug was bug #1205371

Revision history for this message
Dave Cheney (dave-cheney) wrote : Re: [Bug 1244766] Re: uniter: high frequency relation operations cause rate limits to be exceeded

You can use the sample charm in the issue description.

On Thu, Nov 7, 2013 at 6:08 PM, John A Meinel <email address hidden> wrote:
> With the recent changes from Roger and Martin to cache the APIAddress
> content in the database, and my change to make Uniter use the cached
> content this bug *should* be fixed. I still need to actually test it,
> but with trunk today we shouldn't see as many calls to Environ
> information just to run hooks.
>
>
> ** Changed in: juju-core
> Status: Triaged => In Progress
>
> ** Changed in: juju-core
> Milestone: None => 1.17.0
>
> ** Changed in: juju-core
> Assignee: (unassigned) => John A Meinel (jameinel)
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1244766
>
> Title:
> uniter: high frequency relation operations cause rate limits to be
> exceeded
>
> Status in juju-core:
> In Progress
>
> Bug description:
> Creating a charm which causes relation operations to happen at a high
> rate will cause too many calls to the provider and hit rate limits.
>
> To reproduce:
>
> deploy this charm,
> https://github.com/davecheney/charms/tree/master/saucy/xplod
>
> juju deploy --repository=$HOME/charms local:saucy/xplod x1
> juju deploy --repository=$HOME/charms local:saucy/xplod x2
>
> juju add-relation x1:input x2:output
> juju add-relation x2:input x1:output
>
> once x1 and x2 are deployed, juju debug-log will show that every 5
> seconds (api server watcher polling limit) the relation-changed hook
> fires on each unit.
>
> now, increase the number of units
>
> juju add-unit -n5 x1
> juju add-unit -n5 x2
>
> Then watch the log, uniters on will start to fail with errors from the
> API server related to provider rate limiting.
>
> At this point, it will be quite difficult to recover as your ec2
> credentials are also used for the aws console, which makes it hard to
> manually terminate the environment.
>
> Why does the uniter make calls to the provider for this simple
> relation changed hook ?
>
> % cat xplod/hooks/input-relation-changed
> #!/bin/bash
>
> set -xe
>
> V=$(relation-get v)
> V=${V:-0}
> relation-set v=$(expr $V + 1)
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju-core/+bug/1244766/+subscriptions

Revision history for this message
John A Meinel (jameinel) wrote :

So I just tested bootstrapping with trunk @ rev 2034 which is just before my Uniter address cache patch. If you tail the /var/log/juju/machine-0.log file you can see lines like:
  2013-11-07 07:30:43 DEBUG juju state.go:160 waiting for DNS name(s) of state server instances [i-1ffaea7b]

Which are triggered 2x every 5s (when you are running 1-x1 and 1-x2).

After running "juju upgrade-juju" with trunk @ 2036 (my patch was 2035) they are no longer in the lag.

I then turned up the number of units (I just deployed all units to the same machine so I wouldn't have to worry about extra cost). And things kept up cheerily. (And I could 'juju status' which *does* currently use the ec2 API to find out instance status.)

Certainly the log file ends up with lots of churn as these hooks are firing over and over again, and it does put a fair amount of load on an m1.small (load average 2.5). But I was happily running away with six x1 and six x2 units

I was able to get up to 16 units of each, though I think at that point the limiting factor was that there is only 1 hook running per machine at any given time.

Changed in juju-core:
status: In Progress → Fix Committed
Revision history for this message
John A Meinel (jameinel) wrote :

Anyway, lots of hooks running, no calls to the EC2 API means that we don't at least DoS your account credentials when hooks get really busy.
(I saw this when I was doing scale testing and you have 1000 units all trying to just run the started hook)

Revision history for this message
Dave Cheney (dave-cheney) wrote :

Awesome win!

DoSing your account credentials is bad, because you can't use the aws
console (it uses your api keys and limits) or juju ssh (as it needs to
talk to storage).

If you can get to the bootstrap node, killing the api server will
stall everythign and recover your ability to get to the aws console.

On Thu, Nov 7, 2013 at 6:52 PM, John A Meinel <email address hidden> wrote:
> Anyway, lots of hooks running, no calls to the EC2 API means that we don't at least DoS your account credentials when hooks get really busy.
> (I saw this when I was doing scale testing and you have 1000 units all trying to just run the started hook)
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1244766
>
> Title:
> uniter: high frequency relation operations cause rate limits to be
> exceeded
>
> Status in juju-core:
> Fix Committed
>
> Bug description:
> Creating a charm which causes relation operations to happen at a high
> rate will cause too many calls to the provider and hit rate limits.
>
> To reproduce:
>
> deploy this charm,
> https://github.com/davecheney/charms/tree/master/saucy/xplod
>
> juju deploy --repository=$HOME/charms local:saucy/xplod x1
> juju deploy --repository=$HOME/charms local:saucy/xplod x2
>
> juju add-relation x1:input x2:output
> juju add-relation x2:input x1:output
>
> once x1 and x2 are deployed, juju debug-log will show that every 5
> seconds (api server watcher polling limit) the relation-changed hook
> fires on each unit.
>
> now, increase the number of units
>
> juju add-unit -n5 x1
> juju add-unit -n5 x2
>
> Then watch the log, uniters on will start to fail with errors from the
> API server related to provider rate limiting.
>
> At this point, it will be quite difficult to recover as your ec2
> credentials are also used for the aws console, which makes it hard to
> manually terminate the environment.
>
> Why does the uniter make calls to the provider for this simple
> relation changed hook ?
>
> % cat xplod/hooks/input-relation-changed
> #!/bin/bash
>
> set -xe
>
> V=$(relation-get v)
> V=${V:-0}
> relation-set v=$(expr $V + 1)
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju-core/+bug/1244766/+subscriptions

Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.