Calico mechanism driver doesn't spot if Felix is cyclicly restarting

Bug #1649808 reported by Shaun Crampton on 2016-12-14
This bug affects 1 person
Affects Status Importance Assigned to Milestone

Bug Description

Our FV test “test_status_reporting” has been failing.

The test is repeatedly killing Felix and expecting the endpoint to go into state ERROR. I think the (intermittent) issue is caused by making Felix more robust by adjusting the init script to always restart it; sometimes, the test fails to stop Felix from checking in with etcd. I think the behaviour is live-with, but not perfect; the proper fix would be to give networking-calico a better “health estimator” so that it spots that Felix is cyclicly restarting. I don’t think there’s anything wrong in Felix’s behaviour; it's doing its best to check in and put the most up-to-date state into etcd.

The other fix that I can think of is to delay Felix’s initial status report so it never reports if it’s in a crash loop, but I think that has a risk of breaking other restart cases (we might get spurious endpoint error reports after an expected restart).

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers