Juju controllers should hard limit client connections

Bug #1799360 reported by Joel Sing on 2018-10-23
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
juju
High
Unassigned

Bug Description

In our production environment, when a HA Juju controller starts up we see memory usage (RSS) of around 2GB. Memory usage continues to grow based on the number of clients connected to the controller - around 0.8MB per client, based on rough observations. With 11,000 client connections this means that the controller jujud is using around 11GB of memory. Add in a 4GB mongod, some additional jujuds for subordinates and other operational processes, and a host with 16GB is run out of memory, resulting in the jujud for the controller being OOM killed (and once killed, they rarely recover by themselves due to the client DDoS issue - LP: #1793245).

Each controller should have a hard upper bound on the number of client connections it will accept (as far as I'm aware this does not currently exist) - this limit probably needs to be dynamically determined based on the amount of system resources available (e.g. (host memory - 5GB (for mongod et al) - 2GB (for base jujud)) / 1MB =~ 1,000 for an 8GB host; 9,000 for a 16GB host, etc). Bottom line - it is far better to reject connections and have clients connect to another controller (or fail to connect entirely), than push a jujud to the point of being OOM killed.

Haw Loeung (hloeung) on 2018-10-23
description: updated
Changed in juju:
status: New → Triaged
importance: Undecided → High
Paul Gear (paulgear) on 2018-10-24
tags: added: canonical-is
Tim Penhey (thumper) wrote :

I have a feeling that this could be more than just connection load. Is it possible to gather a few heap reports for a loaded controller over a few hours?

Alexandre Gomes (alejdg) wrote :

@thumper here are the reports.

Stuart Bishop (stub) wrote :

This bug is specifically about putting defenses in place so that whole classes of bugs and attacks don't take down the controllers, rather than diagnosing a particular trigger.

Tim Penhey (thumper) wrote :

@stub I agree, and we are putting things into place, but I'm also just checking that we aren't leaking.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers