Comment 12 for bug 1934524

Revision history for this message
John A Meinel (jameinel) wrote :

Our current thoughts are that the requests for more leases can outstrip the backend's ability to process them, and there isn't good back pressure to cause the clients to slow down. Which leads to a cascade failure where the queue of items to process is taking longer than before the client will retry, so you end up never catching up. (example, client A requests leadership, backend replies in 6s, but client A has timed out in 5s, and has already issued the next request and is no longer listening for the previous response.)

Interestingly I do see the failure/restart moment:
https://grafana.admin.canonical.com/d/sR1-JkYmz/juju2-controllers-thumpers?orgId=1&var-controller=prodstack5-prodstack5-prodstack-is&var-host=All&var-node=All&from=1627400967967&to=1627438668890

I also see one time that looks like things were falling apart, but there wasn't an need to actually restart:
https://grafana.admin.canonical.com/d/sR1-JkYmz/juju2-controllers-thumpers?orgId=1&var-controller=prodstack5-prodstack5-prodstack-is&var-host=All&var-node=All&from=1627549993845&to=1627645534714

(leadership times are spiking, goroutine count is going up, etc.)

There is another instance of it at:
https://grafana.admin.canonical.com/d/sR1-JkYmz/juju2-controllers-thumpers?orgId=1&var-controller=prodstack5-prodstack5-prodstack-is&var-host=All&var-node=All&from=1627608107638&to=1627615798850

Which again shows that it was able to recover.

These do end up correlated with spikes on "update leaseholders failed" in Mongo operations. (Though it also has the same 'update leaseholders' succeeding at the same time, so it is mostly indicative of a lot of lease churn.)
https://grafana.admin.canonical.com/d/sR1-JkYmz/juju2-controllers-thumpers?viewPanel=89&orgId=1&var-controller=prodstack5-prodstack5-prodstack-is&var-host=All&var-node=All&from=1627387701024&to=1627457136614