Calico mechanism driver doesn't spot if Felix is cyclicly restarting

Bug #1649808 reported by Shaun Crampton
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
networking-calico
New
Undecided
Unassigned

Bug Description

Our FV test “test_status_reporting” has been failing.

The test is repeatedly killing Felix and expecting the endpoint to go into state ERROR. I think the (intermittent) issue is caused by making Felix more robust by adjusting the init script to always restart it; sometimes, the test fails to stop Felix from checking in with etcd. I think the behaviour is live-with, but not perfect; the proper fix would be to give networking-calico a better “health estimator” so that it spots that Felix is cyclicly restarting. I don’t think there’s anything wrong in Felix’s behaviour; it's doing its best to check in and put the most up-to-date state into etcd.

The other fix that I can think of is to delay Felix’s initial status report so it never reports if it’s in a crash loop, but I think that has a risk of breaking other restart cases (we might get spurious endpoint error reports after an expected restart).

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.