Activity log for bug #1696113

Date Who What changed Old value New value Message
2017-06-06 12:28:09 John A Meinel bug added bug
2017-06-06 12:31:29 John A Meinel description When restarting a controller with many agents connected to it (N agents >2000), it often struggles to come back up in a stable state. We currently limit the number of concurrent login attempts, but we don't do much to try and encourage load balancing to other controllers (when in HA). There are a few points we could try to address: 1) We currently bounce agents at the point where we've successfully negotiated a TLS session, and have gotten their Login information. That means we've done a fair bit of CPU work to negotiate the session key. It would be better to reject the connection before TLS negotiation. 2) Other than existing load slowing down the machine, we don't bias connections to go to machines that aren't the mongo master. And there may be some advantages to being on that machine (slightly faster access to the DB). However, at scale Mongo may ultimately want to consume most of the machines CPU + RAM and we may need to start biasing logins away long before the load is sufficient to actually push them off naturally. One possibility is that we could just introduce a sleep between socket.accept and handling the connection. We could make that delay slightly larger on the controller machine. Or vary the size of the delay based on the number of current connections, (both). This is more about "the load right now isn't a problem, but in 10-minutes it could be a much bigger problem", so we should be a bit more proactive about spreading connections. 3) If things come up slowly, it is entirely plausible that we would have all agents connected to 1 controller, even though we have HA. Load was never enough to switch to the second node in the list. We might want to change our logic for *agents* (probably not clients), to use "select a random address to prioritize" instead of prioritizing the last connection. 4) We may also want to introduce something more like exponential backoff/randomized backoff rather than fixed wait before retrying. We could put this in the controller side (keep a small buffer of IP addresses to force sleep, randomize the wait time, etc). When restarting a controller with many agents connected to it (N agents >2000), it often struggles to come back up in a stable state. We currently limit the number of concurrent login attempts, but we don't do much to try and encourage load balancing to other controllers (when in HA). There are a few points we could try to address: 1) We currently bounce agents at the point where we've successfully negotiated a TLS session, and have gotten their Login information. That means we've done a fair bit of CPU work to negotiate the session key. It would be better to reject the connection before TLS negotiation. 2) Other than existing load slowing down the machine, we don't bias connections to go to machines that aren't the mongo master. And there may be some advantages to being on that machine (slightly faster access to the DB). However, at scale Mongo may ultimately want to consume most of the machines CPU + RAM and we may need to start biasing logins away long before the load is sufficient to actually push them off naturally. One possibility is that we could just introduce a sleep between socket.accept and handling the connection. We could make that delay slightly larger on the controller machine. Or vary the size of the delay based on the number of current connections, (both). This is more about "the load right now isn't a problem, but in 10-minutes it could be a much bigger problem", so we should be a bit more proactive about spreading connections. 3) If things come up slowly, it is entirely plausible that we would have all agents connected to 1 controller, even though we have HA. Load was never enough to switch to the second node in the list. We might want to change our logic for *agents* (probably not clients), to use "select a random address to prioritize" instead of prioritizing the last connection. 4) We may also want to introduce something more like exponential backoff/randomized backoff rather than fixed wait before retrying. We could put this in the controller side (keep a small buffer of IP addresses to force sleep, randomize the wait time, etc). 5) Basic motivation was that on a really large model (~250 machines, ~2500 agents) when a given controller restarted, it struggled to actually come up to a working state without actually triggering another restart. Firewalling the API port allowed the other controllers to handle the load, until things settled down. It might also be that we need to be more vigorous during initialization to reject "remote" IP addresses from connecting to 17070 while we are getting set up. (Caveat that we do like it if you can run 'juju status' while we're starting up.) Maybe just make logins a lot slower (sleep 1s), and/or when we see they are an agent request vs a user request we could sleep even longer before dropping the connection.
2017-06-20 09:16:21 Alvaro Uria tags performance scaling 4010 performance scaling
2018-10-23 16:13:06 Junien F bug added subscriber The Canonical Sysadmins
2018-10-23 16:13:08 Junien F bug added subscriber Junien Fridrick
2018-10-23 21:29:29 Haw Loeung bug added subscriber Haw Loeung
2022-11-03 15:19:52 Canonical Juju QA Bot juju: status Triaged Expired
2022-11-03 15:19:54 Canonical Juju QA Bot tags 4010 performance scaling 4010 expirebugs-bot performance scaling