CNI goes useless if the watcher loses connectivity with the API for longer than the retry timeout

Bug #1776676 reported by Antoni Segura Puimedon
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
kuryr-kubernetes
Fix Released
Critical
Michal Dulko

Bug Description

When CNI probes are not enabled, if the API becomes unavailable for longer than the watch retry timeout, the pod watcher gracefully exits but the CNI daemon continues running. In containerized deployments specially this means that the CNI daemon pod will not be restarted and will never go back to working status without manual intervention (deleting the pod).

We should make sure that if no watchers remain, we sys exit the controller/CNI daemon, since it can't be doing anything useful anyway and this way we'll get back to working condition eventually should the API become reachable again.

Changed in kuryr-kubernetes:
status: New → Triaged
importance: Undecided → Critical
assignee: nobody → Antoni Segura Puimedon (celebdor)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kuryr-kubernetes (master)

Fix proposed to branch: master
Review: https://review.openstack.org/575119

Changed in kuryr-kubernetes:
status: Triaged → In Progress
Changed in kuryr-kubernetes:
assignee: Antoni Segura Puimedon (celebdor) → Michal Dulko (michal-dulko-f)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kuryr-kubernetes (master)

Reviewed: https://review.openstack.org/575119
Committed: https://git.openstack.org/cgit/openstack/kuryr-kubernetes/commit/?id=372b835ebb1ebc2592983e3d4cc6dfdc8fa90633
Submitter: Zuul
Branch: master

commit 372b835ebb1ebc2592983e3d4cc6dfdc8fa90633
Author: Antoni Segura Puimedon <email address hidden>
Date: Wed Jun 13 15:45:43 2018 +0200

    process to gracefully exit when last watcher exits

    In case all the watchers (in the CNI case the pod watcher only) have
    gracefully exited, continuing the process only serves to give a false
    appearance of things working. At the same time, it prevents the
    containerized deployment orchestrator from realizing that the Kuryr pod
    is not functional so it does not restart it.

    This fix allows non health proves environments where all watchers have
    gracefully exited to be restarted by k8s/ocp and eventually work again
    should the issue that made the graceful exits happen be solved.

    Change-Id: Id70978e06d980bc0ffa08bcee02d78bef9dcbeb8
    Closes-Bug: #1776676
    Signed-off-by: Antoni Segura Puimedon <email address hidden>

Changed in kuryr-kubernetes:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kuryr-kubernetes (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/583553

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kuryr-kubernetes (stable/queens)

Reviewed: https://review.openstack.org/583553
Committed: https://git.openstack.org/cgit/openstack/kuryr-kubernetes/commit/?id=f4769c3516798a6b5675400a37a8b22aef868398
Submitter: Zuul
Branch: stable/queens

commit f4769c3516798a6b5675400a37a8b22aef868398
Author: Antoni Segura Puimedon <email address hidden>
Date: Wed Jun 13 15:45:43 2018 +0200

    process to gracefully exit when last watcher exits

    In case all the watchers (in the CNI case the pod watcher only) have
    gracefully exited, continuing the process only serves to give a false
    appearance of things working. At the same time, it prevents the
    containerized deployment orchestrator from realizing that the Kuryr pod
    is not functional so it does not restart it.

    This fix allows non health proves environments where all watchers have
    gracefully exited to be restarted by k8s/ocp and eventually work again
    should the issue that made the graceful exits happen be solved.

    Change-Id: Id70978e06d980bc0ffa08bcee02d78bef9dcbeb8
    Closes-Bug: #1776676
    Depends-On: https://review.openstack.org/583922
    Signed-off-by: Antoni Segura Puimedon <email address hidden>
    (cherry picked from commit 372b835ebb1ebc2592983e3d4cc6dfdc8fa90633)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kuryr-kubernetes 0.5.0

This issue was fixed in the openstack/kuryr-kubernetes 0.5.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kuryr-kubernetes 0.4.4

This issue was fixed in the openstack/kuryr-kubernetes 0.4.4 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.