Canonical Juju

Juju clients need to randomise controller IPs and backoff appropriately

Series 2.4
Bug #1793245

Bug #1793245 reported by Haw Loeung on 2018-09-19

18

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Fix Released	High	John A Meinel	Canonical Juju 2.5-beta1
	2.4	Fix Released	High	John A Meinel	Canonical Juju 2.4.6

Bug Description

Hi,

This has happened a couple of times now. I know that there's an open bug for lots of MongoDB connections (LP:1786258) but not sure if there's one for agents DoS'ing the controller.

Today, we've had the Juju machine agent OOM killed on one of the controllers (ubuntu/1). This caused a storm where a bunch of agents were all connecting to a single controller (which happens to also be the juju-db/MongoDB primary so may be related in that it's more busier than the others).

Controllers are:

| ubuntu/0 10.25.2.109
| ubuntu/1 10.25.2.111
| ubuntu/2 10.25.2.110

The one that was getting hammered was ubuntu/1. The work-around was to firewall off client connections on ubuntu/1.

I'm not sure what magic happens and how clients pick which controller it should talk to. Does it try to connect to all 3 and completing the handshake to the one that responds first? Or does it pick from a list given hitting the first?

For the Juju 2 controller environment, .local/share/juju/controllers.yaml looks like this:

| api-endpoints: ['10.25.2.111:17070', '10.25.2.110:17070', '10.25.2.109:17070']

See original description

Revision history for this message

Haw Loeung (hloeung) wrote on 2018-09-19:

#1

Oh yeah, the other time we've seen this happen was when upgrading the controllers from 2.3.4 to 2.4.3.

description:

updated

Haw Loeung (hloeung) on 2018-09-19

summary:	- Juju agents DoS'ing the server + Juju agents DoS'ing the controller
description:	updated

Richard Harding (rharding) on 2018-09-20

Changed in juju:
status:	New → Triaged
importance:	Undecided → High
milestone:	none → 2.5-beta1

Revision history for this message

Joel Sing (jsing) wrote on 2018-09-21: Re: Juju agents DoS'ing the controller

#2

From a quick glance at the code (api/apiclient.go), a Juju client is attempting to connect to the given list of controllers, somewhat in parallel, with a 50ms gap between dialing each controller. There are various issues here:

- The order of the controller IPs appears to be the same across all clients - since the code does not randomise the address order the first controller will always be dialed first.

- A gap of only 50ms means that when the controllers are more heavily loaded, all controllers will be dialled and attempt to respond. If they are not able to respond in time the client will timeout (likely leaving server resources in use) and try again, making the situation worse.

- I could not see any evidence of exponential backoff (unless it is at a separate layer), which means that in the controller load/failure situation the controllers are unlikely to recover on their own (we've had multiple cases where firewalling off external clients has been the only way to resolve controller problems).

There are several things that should be considered to make this more robust and facilitate recovery:

- The list of controller addresses should be randomised by the client (i.e. a Fisher-Yates shuffle or similar) prior to each dial attempt. That way the order is indeterminate and the load will be spread more evenly.

- The gap between dialling each controller should be a lot more than 50ms (considering the time for a TCP connection, web socket overhead, authentication, etc - I suspect 500ms or even a second would not be unreasonable as a base value).

- Clients should back off exponentially - if we've tried dialing each controller and failed to establish a connection, then we should double the delay before trying again, up to some hard limit (and possibly double the gap between dialling as well).

Another thing to mention here - in an EMFILE condition (which we've seen multiple times in production), a controller will fail to accept connections from itself, which leads to further degradation and failures - it would be worth working out a way to avoid this problem (for example, running a separate API listener on a different port for controller use, or reserving a pool of connections for the controller and more aggressively accepting/closing client connections).

From a quick glance at the code (api/apiclient.go), a Juju client is attempting to connect to the given list of controllers, somewhat in parallel, with a 50ms gap between dialing each controller. There are various issues here:

- The order of the controller IPs appears to be the same across all clients - since the code does not randomise the address order the first controller will always be dialed first.

- A gap of only 50ms means that when the controllers are more heavily loaded, all controllers will be dialled and attempt to respond. If they are not able to respond in time the client will timeout (likely leaving server resources in use) and try again, making the situation worse.

- I could not see any evidence of exponential backoff (unless it is at a separate layer), which means that in the controller load/failure situation the controllers are unlikely to recover on their own (we've had multiple cases where firewalling off external clients has been the only way to resolve controller problems).

There are several things that should be considered to make this more robust and facilitate recovery:

- The list of controller addresses should be randomised by the client (i.e. a Fisher-Yates shuffle or similar) prior to each dial attempt. That way the order is indeterminate and the load will be spread more evenly.

- The gap between dialling each controller should be a lot more than 50ms (considering the time for a TCP connection, web socket overhead, authentication, etc - I suspect 500ms or even a second would not be unreasonable as a base value).

- Clients should back off exponentially - if we've tried dialing each controller and failed to establish a connection, then we should double the delay before trying again, up to some hard limit (and possibly double the gap between dialling as well).

Another thing to mention here - in an EMFILE condition (which we've seen multiple times in production), a controller will fail to accept connections from itself, which leads to further degradation and failures - it would be worth working out a way to avoid this problem (for example, running a separate API listener on a different port for controller use, or reserving a pool of connections for the controller and more aggressively accepting/closing client connections).

Revision history for this message

Haw Loeung (hloeung) wrote on 2018-10-15:

#3

Related discussion - https://discourse.jujucharms.com/t/stable-controller-startup-under-heavy-agent-load/296

Joel Sing (jsing) on 2018-10-23

summary:

- Juju agents DoS'ing the controller
+ Juju clients need to randomise controller IPs and backoff appropriately

Revision history for this message

Joel Sing (jsing) wrote on 2018-10-23:

#4

I'm retargetting this to the bad client behaviour (lack of randomisation, aggressive dialing, lack of exponential backoff on failure, etc). I've raised separate bugs for the controller-side issues:

https://bugs.launchpad.net/juju/+bug/1799360
https://bugs.launchpad.net/juju/+bug/1799363
https://bugs.launchpad.net/juju/+bug/1799365

Revision history for this message

John A Meinel (jameinel) wrote on 2018-10-24:

#5

https://github.com/juju/juju/pull/9360 addresses Agents randomizing their connections and exponential backoff.

Changed in juju:
assignee:	nobody → John A Meinel (jameinel)
status:	Triaged → In Progress

Revision history for this message

Tim Penhey (thumper) wrote on 2018-11-13:

#6

Became a bit of a pie in the face moment due to bug 1802824.

John A Meinel (jameinel) on 2018-11-13

Changed in juju:
status:	In Progress → Fix Committed

Anastasia (anastasia-macmood) on 2019-03-22

Changed in juju:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.