Calico Charm

charms do not inform user when calico-node is in a restart loop

Bug #1895854 reported by Chris Sanders on 2020-09-16

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Calico Charm	Incomplete	Undecided	Unassigned
	Canal Charm	Incomplete	Undecided	Unassigned

Bug Description

During troubleshooting of an issue I found that Canal (deployed with the Calico charm) uses a systemd that is configured to restart always with no limit.

This results in a very quick recycle of the service which never ends. In this case there was an error in the service configuration and checking the service status would typically show it as running when in fact it was starting, failing, and restarting in a loop.

Ideally, systemd should be set with a start interval and burst such that continued failures do not result in a persistent restart loop.

The configuration error in this case was having rp_filter=2 and not having enabled the charm option to allow this setting. Therefore making this same configuration is a good reproducer to see the continual restarting of the service.

See original description

Revision history for this message

Chris Sanders (chris.sanders) wrote on 2020-09-16:

For this site the following is in use.
Kubernetes 1.18.8
Kubernetes-worker charm: 696
canal: 0.10.0/3.10.1
canal charm: 733

I believe these are all of the relevant charm versions. I'm not clear given Canal is used with Calico if both charms are affected so have subscribed both to for triage.

Chris Sanders (chris.sanders) on 2020-09-16

description:

updated

Revision history for this message

George Kraft (cynerva) wrote on 2020-09-29:

Thanks for the report. I discussed this with the team, and while we do recognize the problem, we believe that the proposed solution is likely to cause more harm than good. We have seen time and time again that allowing services to "give up" seriously undermines the robustness of the cluster, and in particular its resiliency against temporary outages and race conditions in the charms.

That said, we don't want to leave this unaddressed. I think, at a minimum, the charm should enter a Blocked state to warn the user if rp_filter=2 sysctl is set but the ignore-loose-rpf charm config is not.

I think it would also be valuable to have the charms monitor the calico-node service and trigger an alert if the service has restarted more than X times in Y minutes. We would likely want to put the charm in Blocked status if this occurs. An alert like this might need to be tunable via charm config.

Would the above solutions work for your needs? We are open to discussing other solutions as well, if you have any ideas.

summary:	- Add a restart limit to canal + charms do not inform user when calico-node is in a restart loop
Changed in charm-calico:
status:	New → Incomplete
Changed in charm-canal:
status:	New → Incomplete

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.