Activity log for bug #1799360

Date Who What changed Old value New value Message
2018-10-23 07:11:25 Joel Sing bug added bug
2018-10-23 08:22:32 Haw Loeung bug added subscriber The Canonical Sysadmins
2018-10-23 08:22:45 Haw Loeung description In our production environment, when a HA Juju controller starts up we see memory usage (RSS) of around 2GB. Memory usage continues to grow based on the number of clients connected to the controller - around 0.8MB per client, based on rough observations. With 11,000 client connections this means that the controller jujud is using around 11GB of memory. Add in a 4GB mongod, some additional jujuds for subordinates and other operational processes, and a host with 16GB is run out of memory, resulting in the jujud for the controller being OOM killed (and once killed, they rarely recover by themselves due to the client DDoS issue - lp#1793245). Each controller should have a hard upper bound on the number of client connections it will accept (as far as I'm aware this does not currently exist) - this limit probably needs to be dynamically determined based on the amount of system resources available (e.g. (host memory - 5GB (for mongod et al) - 2GB (for base jujud)) / 1MB =~ 1,000 for an 8GB host; 9,000 for a 16GB host, etc). Bottom line - it is far better to reject connections and have clients connect to another controller (or fail to connect entirely), than push a jujud to the point of being OOM killed. In our production environment, when a HA Juju controller starts up we see memory usage (RSS) of around 2GB. Memory usage continues to grow based on the number of clients connected to the controller - around 0.8MB per client, based on rough observations. With 11,000 client connections this means that the controller jujud is using around 11GB of memory. Add in a 4GB mongod, some additional jujuds for subordinates and other operational processes, and a host with 16GB is run out of memory, resulting in the jujud for the controller being OOM killed (and once killed, they rarely recover by themselves due to the client DDoS issue - LP: #1793245). Each controller should have a hard upper bound on the number of client connections it will accept (as far as I'm aware this does not currently exist) - this limit probably needs to be dynamically determined based on the amount of system resources available (e.g. (host memory - 5GB (for mongod et al) - 2GB (for base jujud)) / 1MB =~ 1,000 for an 8GB host; 9,000 for a 16GB host, etc). Bottom line - it is far better to reject connections and have clients connect to another controller (or fail to connect entirely), than push a jujud to the point of being OOM killed.
2018-10-23 08:24:15 Haw Loeung bug added subscriber Haw Loeung
2018-10-23 15:46:55 Richard Harding juju: status New Triaged
2018-10-23 15:46:57 Richard Harding juju: importance Undecided High
2018-10-24 01:36:55 Paul Gear tags canonical-is
2018-10-25 21:04:19 Alexandre Gomes attachment added heap_reports.tgz https://bugs.launchpad.net/juju/+bug/1799360/+attachment/5205645/+files/heap_reports.tgz
2022-11-03 16:43:14 Canonical Juju QA Bot juju: importance High Low
2022-11-03 16:43:16 Canonical Juju QA Bot tags canonical-is canonical-is expirebugs-bot