Comment 2 for bug 1895854

Revision history for this message
George Kraft (cynerva) wrote : Re: Add a restart limit to canal

Thanks for the report. I discussed this with the team, and while we do recognize the problem, we believe that the proposed solution is likely to cause more harm than good. We have seen time and time again that allowing services to "give up" seriously undermines the robustness of the cluster, and in particular its resiliency against temporary outages and race conditions in the charms.

That said, we don't want to leave this unaddressed. I think, at a minimum, the charm should enter a Blocked state to warn the user if rp_filter=2 sysctl is set but the ignore-loose-rpf charm config is not.

I think it would also be valuable to have the charms monitor the calico-node service and trigger an alert if the service has restarted more than X times in Y minutes. We would likely want to put the charm in Blocked status if this occurs. An alert like this might need to be tunable via charm config.

Would the above solutions work for your needs? We are open to discussing other solutions as well, if you have any ideas.