charms do not inform user when calico-node is in a restart loop

Bug #1895854 reported by Chris Sanders
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Calico Charm
Incomplete
Undecided
Unassigned
Canal Charm
Incomplete
Undecided
Unassigned

Bug Description

During troubleshooting of an issue I found that Canal (deployed with the Calico charm) uses a systemd that is configured to restart always with no limit.

This results in a very quick recycle of the service which never ends. In this case there was an error in the service configuration and checking the service status would typically show it as running when in fact it was starting, failing, and restarting in a loop.

Ideally, systemd should be set with a start interval and burst such that continued failures do not result in a persistent restart loop.

The configuration error in this case was having rp_filter=2 and not having enabled the charm option to allow this setting. Therefore making this same configuration is a good reproducer to see the continual restarting of the service.

Revision history for this message
Chris Sanders (chris.sanders) wrote :

For this site the following is in use.
Kubernetes 1.18.8
Kubernetes-worker charm: 696
canal: 0.10.0/3.10.1
canal charm: 733

I believe these are all of the relevant charm versions. I'm not clear given Canal is used with Calico if both charms are affected so have subscribed both to for triage.

description: updated
Revision history for this message
George Kraft (cynerva) wrote :

Thanks for the report. I discussed this with the team, and while we do recognize the problem, we believe that the proposed solution is likely to cause more harm than good. We have seen time and time again that allowing services to "give up" seriously undermines the robustness of the cluster, and in particular its resiliency against temporary outages and race conditions in the charms.

That said, we don't want to leave this unaddressed. I think, at a minimum, the charm should enter a Blocked state to warn the user if rp_filter=2 sysctl is set but the ignore-loose-rpf charm config is not.

I think it would also be valuable to have the charms monitor the calico-node service and trigger an alert if the service has restarted more than X times in Y minutes. We would likely want to put the charm in Blocked status if this occurs. An alert like this might need to be tunable via charm config.

Would the above solutions work for your needs? We are open to discussing other solutions as well, if you have any ideas.

summary: - Add a restart limit to canal
+ charms do not inform user when calico-node is in a restart loop
Changed in charm-calico:
status: New → Incomplete
Changed in charm-canal:
status: New → Incomplete
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.