Activity log for bug #1799365

Date Who What changed Old value New value Message
2018-10-23 07:44:09 Joel Sing bug added bug
2018-10-23 08:23:42 Haw Loeung bug added subscriber The Canonical Sysadmins
2018-10-23 08:24:06 Haw Loeung description Once a Juju client connects to a Juju HA controller, it will remain connected to that controller until either the client or the controller is restarted. Due to current bad client behaviour (lp#1793245), the clients will target a single controller and that controller will continue to gain clients, while the other controllers will not - this client behaviour, coupled with the fact that there appears to be no hard limit on the number of client connections a controller will accept, leads to Juju controllers being OOM killed (lp#1799360). This in turn leads to stability issues due to the bad client behaviour combined with the controller-to-self and controller-to-controller communication often failing (lp#1799363). Juju HA controllers should actively work to distribute client connections - some of the options to do this include: - Randomly failing client connections (e.g. reject one in three connections on the basis that the clients will try another controller and/or retry). - Communicate the number client connections between controllers and disconnect clients when a controller has X (say 500-1000) more clients than the others in stable state. This requires clients to be better behaved (lp#1793245 needs to be fixed first), so that they are likely to reconnect to another controller. - Provide a "redirect to controller X" in the API - this is similar to the above, but allows clients to be specifically directed to another controller that is known to be healthy and less loaded. - Front the Juju HA controllers with some form of load balancer that actively distributes incoming client connections to the jujud API servers. It is worth noting that part of the connection distribution problem can simply be avoided by having clients that randomise the controller IP list. This does not however address situations that are caused by bringing one controller down and back up again (it will still have ~0 connections until clients are disconnected from other controllers over some time period). Once a Juju client connects to a Juju HA controller, it will remain connected to that controller until either the client or the controller is restarted. Due to current bad client behaviour (LP: #1793245), the clients will target a single controller and that controller will continue to gain clients, while the other controllers will not - this client behaviour, coupled with the fact that there appears to be no hard limit on the number of client connections a controller will accept, leads to Juju controllers being OOM killed (LP: #1799360). This in turn leads to stability issues due to the bad client behaviour combined with the controller-to-self and controller-to-controller communication often failing (LP: #1799363). Juju HA controllers should actively work to distribute client connections - some of the options to do this include: - Randomly failing client connections (e.g. reject one in three connections on the basis that the clients will try another controller and/or retry). - Communicate the number client connections between controllers and disconnect clients when a controller has X (say 500-1000) more clients than the others in stable state. This requires clients to be better behaved (LP: #1793245 needs to be fixed first), so that they are likely to reconnect to another controller. - Provide a "redirect to controller X" in the API - this is similar to the above, but allows clients to be specifically directed to another controller that is known to be healthy and less loaded. - Front the Juju HA controllers with some form of load balancer that actively distributes incoming client connections to the jujud API servers. It is worth noting that part of the connection distribution problem can simply be avoided by having clients that randomise the controller IP list. This does not however address situations that are caused by bringing one controller down and back up again (it will still have ~0 connections until clients are disconnected from other controllers over some time period).
2018-10-23 08:24:37 Haw Loeung bug added subscriber Haw Loeung
2018-10-23 19:29:48 Tim Penhey juju: status New Triaged
2018-10-23 19:29:51 Tim Penhey juju: importance Undecided High
2018-10-23 19:30:28 Tim Penhey tags performance scalability
2018-10-24 01:36:24 Paul Gear tags performance scalability canonical-is performance scalability
2018-10-24 03:26:55 John A Meinel juju: status Triaged In Progress
2018-10-24 03:26:55 John A Meinel juju: milestone 2.5-beta1
2018-10-24 03:26:55 John A Meinel juju: assignee John A Meinel (jameinel)
2018-10-24 03:27:02 John A Meinel nominated for series juju/2.4
2018-10-24 03:27:02 John A Meinel bug task added juju/2.4
2018-10-24 03:27:09 John A Meinel juju/2.4: status New In Progress
2018-10-24 03:27:11 John A Meinel juju/2.4: importance Undecided High
2018-10-24 03:27:15 John A Meinel juju/2.4: assignee John A Meinel (jameinel)
2018-10-24 03:27:18 John A Meinel juju/2.4: milestone 2.4.5
2018-11-06 21:34:19 Canonical Juju QA Bot juju/2.4: milestone 2.4.5 2.4.6
2018-11-13 13:39:20 John A Meinel juju/2.4: status In Progress Fix Committed
2018-11-13 13:39:21 John A Meinel juju: status In Progress Fix Committed
2018-11-15 02:16:50 Canonical Juju QA Bot juju/2.4: status Fix Committed Fix Released
2019-01-31 03:07:02 Joel Sing juju: status Fix Committed New
2019-01-31 03:07:05 Joel Sing juju/2.4: status Fix Released New
2019-02-01 02:35:54 Tim Penhey juju: milestone 2.5-beta1 2.5.2
2019-02-01 02:35:57 Tim Penhey juju: assignee John A Meinel (jameinel)
2019-02-01 02:36:04 Tim Penhey bug task deleted juju/2.4
2019-02-01 02:36:09 Tim Penhey juju: status New Triaged
2019-03-11 23:01:35 Canonical Juju QA Bot juju: milestone 2.5.2 2.5.3
2019-03-26 02:08:56 Canonical Juju QA Bot juju: milestone 2.5.3 2.5.4
2019-04-02 01:46:17 Canonical Juju QA Bot juju: milestone 2.5.4 2.5.5
2019-05-14 06:17:19 Anastasia juju: milestone 2.5.6 2.5.8
2019-06-28 02:03:24 Canonical Juju QA Bot juju: milestone 2.5.8 2.5.9
2019-10-31 03:37:13 Anastasia juju: milestone 2.5.9
2019-10-31 03:37:14 Tim Penhey summary Juju HA controllers need to distribute client connections Juju HA controllers need to actively distribute client connections
2019-10-31 03:37:20 Tim Penhey juju: importance High Medium
2022-11-03 16:54:58 Canonical Juju QA Bot juju: importance Medium Low
2022-11-03 16:54:59 Canonical Juju QA Bot tags canonical-is performance scalability canonical-is expirebugs-bot performance scalability
2023-01-25 14:28:54 Junien F juju: importance Low Medium
2023-01-26 12:51:43 Juan M. Tirado juju: milestone 3.2-backlog
2023-02-03 10:02:39 Vitaly Antonenko juju: milestone 3.2-backlog