Juju clients need to randomise controller IPs and backoff appropriately

Bug #1793245 reported by Haw Loeung on 2018-09-19
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
juju
High
John A Meinel
2.4
High
John A Meinel

Bug Description

Hi,

This has happened a couple of times now. I know that there's an open bug for lots of MongoDB connections (LP:1786258) but not sure if there's one for agents DoS'ing the controller.

Today, we've had the Juju machine agent OOM killed on one of the controllers (ubuntu/1). This caused a storm where a bunch of agents were all connecting to a single controller (which happens to also be the juju-db/MongoDB primary so may be related in that it's more busier than the others).

Controllers are:

| ubuntu/0 10.25.2.109
| ubuntu/1 10.25.2.111
| ubuntu/2 10.25.2.110

The one that was getting hammered was ubuntu/1. The work-around was to firewall off client connections on ubuntu/1.

I'm not sure what magic happens and how clients pick which controller it should talk to. Does it try to connect to all 3 and completing the handshake to the one that responds first? Or does it pick from a list given hitting the first?

For the Juju 2 controller environment, .local/share/juju/controllers.yaml looks like this:

| api-endpoints: ['10.25.2.111:17070', '10.25.2.110:17070', '10.25.2.109:17070']

Haw Loeung (hloeung) wrote :

Oh yeah, the other time we've seen this happen was when upgrading the controllers from 2.3.4 to 2.4.3.

description: updated
Haw Loeung (hloeung) on 2018-09-19
summary: - Juju agents DoS'ing the server
+ Juju agents DoS'ing the controller
description: updated
Changed in juju:
status: New → Triaged
importance: Undecided → High
milestone: none → 2.5-beta1

From a quick glance at the code (api/apiclient.go), a Juju client is attempting to connect to the given list of controllers, somewhat in parallel, with a 50ms gap between dialing each controller. There are various issues here:

 - The order of the controller IPs appears to be the same across all clients - since the code does not randomise the address order the first controller will always be dialed first.

 - A gap of only 50ms means that when the controllers are more heavily loaded, all controllers will be dialled and attempt to respond. If they are not able to respond in time the client will timeout (likely leaving server resources in use) and try again, making the situation worse.

 - I could not see any evidence of exponential backoff (unless it is at a separate layer), which means that in the controller load/failure situation the controllers are unlikely to recover on their own (we've had multiple cases where firewalling off external clients has been the only way to resolve controller problems).

There are several things that should be considered to make this more robust and facilitate recovery:

 - The list of controller addresses should be randomised by the client (i.e. a Fisher-Yates shuffle or similar) prior to each dial attempt. That way the order is indeterminate and the load will be spread more evenly.

 - The gap between dialling each controller should be a lot more than 50ms (considering the time for a TCP connection, web socket overhead, authentication, etc - I suspect 500ms or even a second would not be unreasonable as a base value).

 - Clients should back off exponentially - if we've tried dialing each controller and failed to establish a connection, then we should double the delay before trying again, up to some hard limit (and possibly double the gap between dialling as well).

Another thing to mention here - in an EMFILE condition (which we've seen multiple times in production), a controller will fail to accept connections from itself, which leads to further degradation and failures - it would be worth working out a way to avoid this problem (for example, running a separate API listener on a different port for controller use, or reserving a pool of connections for the controller and more aggressively accepting/closing client connections).

Joel Sing (jsing) on 2018-10-23
summary: - Juju agents DoS'ing the controller
+ Juju clients need to randomise controller IPs and backoff appropriately
Joel Sing (jsing) wrote :

I'm retargetting this to the bad client behaviour (lack of randomisation, aggressive dialing, lack of exponential backoff on failure, etc). I've raised separate bugs for the controller-side issues:

https://bugs.launchpad.net/juju/+bug/1799360
https://bugs.launchpad.net/juju/+bug/1799363
https://bugs.launchpad.net/juju/+bug/1799365

John A Meinel (jameinel) wrote :

https://github.com/juju/juju/pull/9360 addresses Agents randomizing their connections and exponential backoff.

Changed in juju:
assignee: nobody → John A Meinel (jameinel)
status: Triaged → In Progress
Tim Penhey (thumper) wrote :

Became a bit of a pie in the face moment due to bug 1802824.

John A Meinel (jameinel) on 2018-11-13
Changed in juju:
status: In Progress → Fix Committed
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers