Our current thoughts are that the requests for more leases can outstrip the backend's ability to process them, and there isn't good back pressure to cause the clients to slow down. Which leads to a cascade failure where the queue of items to process is taking longer than before the client will retry, so you end up never catching up. (example, client A requests leadership, backend replies in 6s, but client A has timed out in 5s, and has already issued the next request and is no longer listening for the previous response.)
Our current thoughts are that the requests for more leases can outstrip the backend's ability to process them, and there isn't good back pressure to cause the clients to slow down. Which leads to a cascade failure where the queue of items to process is taking longer than before the client will retry, so you end up never catching up. (example, client A requests leadership, backend replies in 6s, but client A has timed out in 5s, and has already issued the next request and is no longer listening for the previous response.)
Interestingly I do see the failure/restart moment: /grafana. admin.canonical .com/d/ sR1-JkYmz/ juju2-controlle rs-thumpers? orgId=1& var-controller= prodstack5- prodstack5- prodstack- is&var- host=All& var-node= All&from= 1627400967967& to=162743866889 0
https:/
I also see one time that looks like things were falling apart, but there wasn't an need to actually restart: /grafana. admin.canonical .com/d/ sR1-JkYmz/ juju2-controlle rs-thumpers? orgId=1& var-controller= prodstack5- prodstack5- prodstack- is&var- host=All& var-node= All&from= 1627549993845& to=162764553471 4
https:/
(leadership times are spiking, goroutine count is going up, etc.)
There is another instance of it at: /grafana. admin.canonical .com/d/ sR1-JkYmz/ juju2-controlle rs-thumpers? orgId=1& var-controller= prodstack5- prodstack5- prodstack- is&var- host=All& var-node= All&from= 1627608107638& to=162761579885 0
https:/
Which again shows that it was able to recover.
These do end up correlated with spikes on "update leaseholders failed" in Mongo operations. (Though it also has the same 'update leaseholders' succeeding at the same time, so it is mostly indicative of a lot of lease churn.) /grafana. admin.canonical .com/d/ sR1-JkYmz/ juju2-controlle rs-thumpers? viewPanel= 89&orgId= 1&var-controlle r=prodstack5- prodstack5- prodstack- is&var- host=All& var-node= All&from= 1627387701024& to=162745713661 4
https:/