Calico mechanism driver doesn't spot if Felix is cyclicly restarting
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
networking-calico |
New
|
Undecided
|
Unassigned |
Bug Description
Our FV test “test_status_
The test is repeatedly killing Felix and expecting the endpoint to go into state ERROR. I think the (intermittent) issue is caused by making Felix more robust by adjusting the init script to always restart it; sometimes, the test fails to stop Felix from checking in with etcd. I think the behaviour is live-with, but not perfect; the proper fix would be to give networking-calico a better “health estimator” so that it spots that Felix is cyclicly restarting. I don’t think there’s anything wrong in Felix’s behaviour; it's doing its best to check in and put the most up-to-date state into etcd.
The other fix that I can think of is to delay Felix’s initial status report so it never reports if it’s in a crash loop, but I think that has a risk of breaking other restart cases (we might get spurious endpoint error reports after an expected restart).