juju controllers should do more to control startup load
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Expired
|
High
|
Unassigned |
Bug Description
When restarting a controller with many agents connected to it (N agents >2000), it often struggles to come back up in a stable state. We currently limit the number of concurrent login attempts, but we don't do much to try and encourage load balancing to other controllers (when in HA).
There are a few points we could try to address:
1) We currently bounce agents at the point where we've successfully negotiated a TLS session, and have gotten their Login information. That means we've done a fair bit of CPU work to negotiate the session key. It would be better to reject the connection before TLS negotiation.
2) Other than existing load slowing down the machine, we don't bias connections to go to machines that aren't the mongo master. And there may be some advantages to being on that machine (slightly faster access to the DB). However, at scale Mongo may ultimately want to consume most of the machines CPU + RAM and we may need to start biasing logins away long before the load is sufficient to actually push them off naturally. One possibility is that we could just introduce a sleep between socket.accept and handling the connection. We could make that delay slightly larger on the controller machine. Or vary the size of the delay based on the number of current connections, (both).
This is more about "the load right now isn't a problem, but in 10-minutes it could be a much bigger problem", so we should be a bit more proactive about spreading connections.
3) If things come up slowly, it is entirely plausible that we would have all agents connected to 1 controller, even though we have HA. Load was never enough to switch to the second node in the list.
We might want to change our logic for *agents* (probably not clients), to use "select a random address to prioritize" instead of prioritizing the last connection.
4) We may also want to introduce something more like exponential backoff/randomized backoff rather than fixed wait before retrying. We could put this in the controller side (keep a small buffer of IP addresses to force sleep, randomize the wait time, etc).
5) Basic motivation was that on a really large model (~250 machines, ~2500 agents) when a given controller restarted, it struggled to actually come up to a working state without actually triggering another restart. Firewalling the API port allowed the other controllers to handle the load, until things settled down.
It might also be that we need to be more vigorous during initialization to reject "remote" IP addresses from connecting to 17070 while we are getting set up. (Caveat that we do like it if you can run 'juju status' while we're starting up.) Maybe just make logins a lot slower (sleep 1s), and/or when we see they are an agent request vs a user request we could sleep even longer before dropping the connection.
We also don't wait for Mongo to recover to a stable state, we should monitor how Mongo behaves and throttle connections until it's stable.