Comment 2 for bug 1793245

Revision history for this message
Joel Sing (jsing) wrote : Re: Juju agents DoS'ing the controller

From a quick glance at the code (api/apiclient.go), a Juju client is attempting to connect to the given list of controllers, somewhat in parallel, with a 50ms gap between dialing each controller. There are various issues here:

 - The order of the controller IPs appears to be the same across all clients - since the code does not randomise the address order the first controller will always be dialed first.

 - A gap of only 50ms means that when the controllers are more heavily loaded, all controllers will be dialled and attempt to respond. If they are not able to respond in time the client will timeout (likely leaving server resources in use) and try again, making the situation worse.

 - I could not see any evidence of exponential backoff (unless it is at a separate layer), which means that in the controller load/failure situation the controllers are unlikely to recover on their own (we've had multiple cases where firewalling off external clients has been the only way to resolve controller problems).

There are several things that should be considered to make this more robust and facilitate recovery:

 - The list of controller addresses should be randomised by the client (i.e. a Fisher-Yates shuffle or similar) prior to each dial attempt. That way the order is indeterminate and the load will be spread more evenly.

 - The gap between dialling each controller should be a lot more than 50ms (considering the time for a TCP connection, web socket overhead, authentication, etc - I suspect 500ms or even a second would not be unreasonable as a base value).

 - Clients should back off exponentially - if we've tried dialing each controller and failed to establish a connection, then we should double the delay before trying again, up to some hard limit (and possibly double the gap between dialling as well).

Another thing to mention here - in an EMFILE condition (which we've seen multiple times in production), a controller will fail to accept connections from itself, which leads to further degradation and failures - it would be worth working out a way to avoid this problem (for example, running a separate API listener on a different port for controller use, or reserving a pool of connections for the controller and more aggressively accepting/closing client connections).