kuryr-kubernetes

CNI goes useless if the watcher loses connectivity with the API for longer than the retry timeout

Bug #1776676 reported by Antoni Segura Puimedon on 2018-06-13

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	kuryr-kubernetes	Fix Released	Critical	Michal Dulko

Bug Description

When CNI probes are not enabled, if the API becomes unavailable for longer than the watch retry timeout, the pod watcher gracefully exits but the CNI daemon continues running. In containerized deployments specially this means that the CNI daemon pod will not be restarted and will never go back to working status without manual intervention (deleting the pod).

We should make sure that if no watchers remain, we sys exit the controller/CNI daemon, since it can't be doing anything useful anyway and this way we'll get back to working condition eventually should the API become reachable again.

Tags:

Antoni Segura Puimedon (celebdor) on 2018-06-13

Changed in kuryr-kubernetes:
status:	New → Triaged
importance:	Undecided → Critical
assignee:	nobody → Antoni Segura Puimedon (celebdor)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-06-13: Fix proposed to kuryr-kubernetes (master)

Fix proposed to branch: master
Review: https://review.openstack.org/575119

Changed in kuryr-kubernetes:
status:	Triaged → In Progress

OpenStack Infra (hudson-openstack) on 2018-07-11

Changed in kuryr-kubernetes:
assignee:	Antoni Segura Puimedon (celebdor) → Michal Dulko (michal-dulko-f)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-07-17: Fix merged to kuryr-kubernetes (master)

Reviewed: https://review.openstack.org/575119
Committed: https://git.openstack.org/cgit/openstack/kuryr-kubernetes/commit/?id=372b835ebb1ebc2592983e3d4cc6dfdc8fa90633
Submitter: Zuul
Branch: master

commit 372b835ebb1ebc2592983e3d4cc6dfdc8fa90633
Author: Antoni Segura Puimedon <email address hidden>
Date: Wed Jun 13 15:45:43 2018 +0200

process to gracefully exit when last watcher exits

    In case all the watchers (in the CNI case the pod watcher only) have
    gracefully exited, continuing the process only serves to give a false
    appearance of things working. At the same time, it prevents the
    containerized deployment orchestrator from realizing that the Kuryr pod
    is not functional so it does not restart it.

    This fix allows non health proves environments where all watchers have
    gracefully exited to be restarted by k8s/ocp and eventually work again
    should the issue that made the graceful exits happen be solved.

    Change-Id: Id70978e06d980bc0ffa08bcee02d78bef9dcbeb8
    Closes-Bug: #1776676
    Signed-off-by: Antoni Segura Puimedon <email address hidden>

Changed in kuryr-kubernetes:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-07-18: Fix proposed to kuryr-kubernetes (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/583553

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-07-22: Fix merged to kuryr-kubernetes (stable/queens)

Reviewed: https://review.openstack.org/583553
Committed: https://git.openstack.org/cgit/openstack/kuryr-kubernetes/commit/?id=f4769c3516798a6b5675400a37a8b22aef868398
Submitter: Zuul
Branch: stable/queens

commit f4769c3516798a6b5675400a37a8b22aef868398
Author: Antoni Segura Puimedon <email address hidden>
Date: Wed Jun 13 15:45:43 2018 +0200

process to gracefully exit when last watcher exits

    Change-Id: Id70978e06d980bc0ffa08bcee02d78bef9dcbeb8
    Closes-Bug: #1776676
    Depends-On: https://review.openstack.org/583922
    Signed-off-by: Antoni Segura Puimedon <email address hidden>
    (cherry picked from commit 372b835ebb1ebc2592983e3d4cc6dfdc8fa90633)