Juju HA controllers need to distribute client connections

Bug #1799365 reported by Joel Sing on 2018-10-23
This bug affects 3 people
Affects Status Importance Assigned to Milestone

Bug Description

Once a Juju client connects to a Juju HA controller, it will remain connected to that controller until either the client or the controller is restarted. Due to current bad client behaviour (LP: #1793245), the clients will target a single controller and that controller will continue to gain clients, while the other controllers will not - this client behaviour, coupled with the fact that there appears to be no hard limit on the number of client connections a controller will accept, leads to Juju controllers being OOM killed (LP: #1799360). This in turn leads to stability issues due to the bad client behaviour combined with the controller-to-self and controller-to-controller communication often failing (LP: #1799363).

Juju HA controllers should actively work to distribute client connections - some of the options to do this include:

- Randomly failing client connections (e.g. reject one in three connections on the basis that the clients will try another controller and/or retry).

- Communicate the number client connections between controllers and disconnect clients when a controller has X (say 500-1000) more clients than the others in stable state. This requires clients to be better behaved (LP: #1793245 needs to be fixed first), so that they are likely to reconnect to another controller.

- Provide a "redirect to controller X" in the API - this is similar to the above, but allows clients to be specifically directed to another controller that is known to be healthy and less loaded.

- Front the Juju HA controllers with some form of load balancer that actively distributes incoming client connections to the jujud API servers.

It is worth noting that part of the connection distribution problem can simply be avoided by having clients that randomise the controller IP list. This does not however address situations that are caused by bringing one controller down and back up again (it will still have ~0 connections until clients are disconnected from other controllers over some time period).

Haw Loeung (hloeung) on 2018-10-23
description: updated
Tim Penhey (thumper) on 2018-10-23
Changed in juju:
status: New → Triaged
importance: Undecided → High
tags: added: performance scalability
Paul Gear (paulgear) on 2018-10-24
tags: added: canonical-is
John A Meinel (jameinel) wrote :

https://github.com/juju/juju/pull/9360 targets 2.4 with a couple changes that should help connections.

Changing from fast retries to exponential backoff, and rand.Shuffle of the controller addresses for Agents.

Changed in juju:
assignee: nobody → John A Meinel (jameinel)
milestone: none → 2.5-beta1
status: Triaged → In Progress
John A Meinel (jameinel) on 2018-11-13
Changed in juju:
status: In Progress → Fix Committed
Joel Sing (jsing) wrote :

Reopening - this is still an ongoing issue - if for some reason all clients end up connected to a single controller, then they will remain there until something causes the connections to drop (e.g. forcing agent restarts, firewalling off agent connections, bouncing the controller, etc).

Please note that this bug is specifically about *active* load balancing on the controller side. The fix referenced is about improving the client side behaviour, which was detailed in https://bugs.launchpad.net/juju/+bug/1793245.

Changed in juju:
status: Fix Committed → New
Tim Penhey (thumper) wrote :

Retargetting for 2.5.2 for some form of active load balancing.

Changed in juju:
milestone: 2.5-beta1 → 2.5.2
assignee: John A Meinel (jameinel) → nobody
no longer affects: juju/2.4
Changed in juju:
status: New → Triaged
Changed in juju:
milestone: 2.5.2 → 2.5.3
Changed in juju:
milestone: 2.5.3 → 2.5.4
Changed in juju:
milestone: 2.5.4 → 2.5.5
Changed in juju:
milestone: 2.5.6 → 2.5.8
Changed in juju:
milestone: 2.5.8 → 2.5.9
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers