Canonical Juju

Bug #1696113
Activity log

Activity log for bug #1696113

Date	Who	What changed	Old value	New value	Message
2017-06-06 12:28:09	John A Meinel	bug			added bug
2017-06-06 12:31:29	John A Meinel	description	When restarting a controller with many agents connected to it (N agents >2000), it often struggles to come back up in a stable state. We currently limit the number of concurrent login attempts, but we don't do much to try and encourage load balancing to other controllers (when in HA). There are a few points we could try to address: 1) We currently bounce agents at the point where we've successfully negotiated a TLS session, and have gotten their Login information. That means we've done a fair bit of CPU work to negotiate the session key. It would be better to reject the connection before TLS negotiation. 2) Other than existing load slowing down the machine, we don't bias connections to go to machines that aren't the mongo master. And there may be some advantages to being on that machine (slightly faster access to the DB). However, at scale Mongo may ultimately want to consume most of the machines CPU + RAM and we may need to start biasing logins away long before the load is sufficient to actually push them off naturally. One possibility is that we could just introduce a sleep between socket.accept and handling the connection. We could make that delay slightly larger on the controller machine. Or vary the size of the delay based on the number of current connections, (both). This is more about "the load right now isn't a problem, but in 10-minutes it could be a much bigger problem", so we should be a bit more proactive about spreading connections. 3) If things come up slowly, it is entirely plausible that we would have all agents connected to 1 controller, even though we have HA. Load was never enough to switch to the second node in the list. We might want to change our logic for agents (probably not clients), to use "select a random address to prioritize" instead of prioritizing the last connection. 4) We may also want to introduce something more like exponential backoff/randomized backoff rather than fixed wait before retrying. We could put this in the controller side (keep a small buffer of IP addresses to force sleep, randomize the wait time, etc).	When restarting a controller with many agents connected to it (N agents >2000), it often struggles to come back up in a stable state. We currently limit the number of concurrent login attempts, but we don't do much to try and encourage load balancing to other controllers (when in HA). There are a few points we could try to address: 1) We currently bounce agents at the point where we've successfully negotiated a TLS session, and have gotten their Login information. That means we've done a fair bit of CPU work to negotiate the session key. It would be better to reject the connection before TLS negotiation. 2) Other than existing load slowing down the machine, we don't bias connections to go to machines that aren't the mongo master. And there may be some advantages to being on that machine (slightly faster access to the DB). However, at scale Mongo may ultimately want to consume most of the machines CPU + RAM and we may need to start biasing logins away long before the load is sufficient to actually push them off naturally. One possibility is that we could just introduce a sleep between socket.accept and handling the connection. We could make that delay slightly larger on the controller machine. Or vary the size of the delay based on the number of current connections, (both). This is more about "the load right now isn't a problem, but in 10-minutes it could be a much bigger problem", so we should be a bit more proactive about spreading connections. 3) If things come up slowly, it is entirely plausible that we would have all agents connected to 1 controller, even though we have HA. Load was never enough to switch to the second node in the list. We might want to change our logic for agents (probably not clients), to use "select a random address to prioritize" instead of prioritizing the last connection. 4) We may also want to introduce something more like exponential backoff/randomized backoff rather than fixed wait before retrying. We could put this in the controller side (keep a small buffer of IP addresses to force sleep, randomize the wait time, etc). 5) Basic motivation was that on a really large model (~250 machines, ~2500 agents) when a given controller restarted, it struggled to actually come up to a working state without actually triggering another restart. Firewalling the API port allowed the other controllers to handle the load, until things settled down. It might also be that we need to be more vigorous during initialization to reject "remote" IP addresses from connecting to 17070 while we are getting set up. (Caveat that we do like it if you can run 'juju status' while we're starting up.) Maybe just make logins a lot slower (sleep 1s), and/or when we see they are an agent request vs a user request we could sleep even longer before dropping the connection.
2017-06-20 09:16:21	Alvaro Uria	tags	performance scaling	4010 performance scaling
2018-10-23 16:13:06	Junien F	bug			added subscriber The Canonical Sysadmins
2018-10-23 16:13:08	Junien F	bug			added subscriber Junien Fridrick
2018-10-23 21:29:29	Haw Loeung	bug			added subscriber Haw Loeung
2022-11-03 15:19:52	Canonical Juju QA Bot	juju: status	Triaged	Expired
2022-11-03 15:19:54	Canonical Juju QA Bot	tags	4010 performance scaling	4010 expirebugs-bot performance scaling