juju controllers should do more to control startup load

Bug #1696113 reported by John A Meinel
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Expired
High
Unassigned

Bug Description

When restarting a controller with many agents connected to it (N agents >2000), it often struggles to come back up in a stable state. We currently limit the number of concurrent login attempts, but we don't do much to try and encourage load balancing to other controllers (when in HA).

There are a few points we could try to address:

1) We currently bounce agents at the point where we've successfully negotiated a TLS session, and have gotten their Login information. That means we've done a fair bit of CPU work to negotiate the session key. It would be better to reject the connection before TLS negotiation.

2) Other than existing load slowing down the machine, we don't bias connections to go to machines that aren't the mongo master. And there may be some advantages to being on that machine (slightly faster access to the DB). However, at scale Mongo may ultimately want to consume most of the machines CPU + RAM and we may need to start biasing logins away long before the load is sufficient to actually push them off naturally. One possibility is that we could just introduce a sleep between socket.accept and handling the connection. We could make that delay slightly larger on the controller machine. Or vary the size of the delay based on the number of current connections, (both).

This is more about "the load right now isn't a problem, but in 10-minutes it could be a much bigger problem", so we should be a bit more proactive about spreading connections.

3) If things come up slowly, it is entirely plausible that we would have all agents connected to 1 controller, even though we have HA. Load was never enough to switch to the second node in the list.

We might want to change our logic for *agents* (probably not clients), to use "select a random address to prioritize" instead of prioritizing the last connection.

4) We may also want to introduce something more like exponential backoff/randomized backoff rather than fixed wait before retrying. We could put this in the controller side (keep a small buffer of IP addresses to force sleep, randomize the wait time, etc).

5) Basic motivation was that on a really large model (~250 machines, ~2500 agents) when a given controller restarted, it struggled to actually come up to a working state without actually triggering another restart. Firewalling the API port allowed the other controllers to handle the load, until things settled down.
It might also be that we need to be more vigorous during initialization to reject "remote" IP addresses from connecting to 17070 while we are getting set up. (Caveat that we do like it if you can run 'juju status' while we're starting up.) Maybe just make logins a lot slower (sleep 1s), and/or when we see they are an agent request vs a user request we could sleep even longer before dropping the connection.

John A Meinel (jameinel)
description: updated
Alvaro Uria (aluria)
tags: added: 4010
Revision history for this message
Witold Krecicki (wpk) wrote :

We also don't wait for Mongo to recover to a stable state, we should monitor how Mongo behaves and throttle connections until it's stable.

Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote :

This bug has not been updated in 5 years, so we're marking it Expired. If you believe this is incorrect, please update the status.

Changed in juju:
status: Triaged → Expired
tags: added: expirebugs-bot
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.