Brief Description
-----------------
After an overnight soak with no application pods, ceph-pools-audit pod remained stuck in ContainerCreating state and no application pods can start on the standby controller.
Severity
--------
Critical
Steps to Reproduce
------------------
Run various system tests and some experimental tests which entailed launching pods with different registries, pod life cycles and image pull policies.
Soak system overnight with no application pods.
Expected Behavior
------------------
Ceph-pools-audit pod can be created at each audit cycle and applications pods can again be scheduled and can run on all nodes including controller-1 after the soak.
Actual Behavior
----------------
Ceph-pools-audit pod is stuck in ContainerCreating state on controller-1 and same are the newly launched application pods.
>kubectl describe pod -n kube-system ceph-pools-audit-1570654500-8g9kd
...
Warning FailedCreatePodSandBox 16h kubelet, controller-1 Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "bf09a329e420195005783041fd4be31e2a3ab3a1396e9a5f40ca3b69d5dc6267" network for pod "ceph-pools-audit-1570654500-8g9kd": NetworkPlugin cni failed to set up pod "ceph-pools-audit-1570654500-8g9kd_kube-system" network: Multus: Err adding pod to network "chain": Multus: error in invoke Conflist add - "chain": error in getting result from AddNetworkList: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout, failed to clean up sandbox container "bf09a329e420195005783041fd4be31e2a3ab3a1396e9a5f40ca3b69d5dc6267" network for pod "ceph-pools-audit-1570654500-8g9kd": NetworkPlugin cni failed to teardown pod "ceph-pools-audit-1570654500-8g9kd_kube-system" network: Multus: error in invoke Conflist Del - "chain": error in getting result from DelNetworkList: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout]
Warning FailedCreatePodSandBox 16h kubelet, controller-1 Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "1c70c9f92daf52d49d036424d5b626b01285801e86c5b78f5a9db2ba497c892b" network for pod "ceph-pools-audit-1570654500-8g9kd": NetworkPlugin cni failed to set up pod "ceph-pools-audit-1570654500-8g9kd_kube-system" network: Multus: Err adding pod to network "chain": Multus: error in invoke Conflist add - "chain": error in getting result from AddNetworkList: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout, failed to clean up sandbox container "1c70c9f92daf52d49d036424d5b626b01285801e86c5b78f5a9db2ba497c892b" network for pod "ceph-pools-audit-1570654500-8g9kd": NetworkPlugin cni failed to teardown pod "ceph-pools-audit-1570654500-8g9kd_kube-system" network: Multus: error in invoke Conflist Del - "chain": error in getting result from DelNetworkList: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout]
Normal SandboxChanged 16h (x15 over 17h) kubelet, controller-1 Pod sandbox changed, it will be killed and re-created.
Warning FailedCreatePodSandBox 52s (x478 over 16h) kubelet, controller-1 (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "dbfb5eeae5a79775cdecf317e207ec1ff1b4e6dd1e877988f3390dfb33a385ea" network for pod "ceph-pools-audit-1570654500-8g9kd": NetworkPlugin cni failed to set up pod "ceph-pools-audit-1570654500-8g9kd_kube-system" network: Multus: Err adding pod to network "chain": Multus: error in invoke Conflist add - "chain": error in getting result from AddNetworkList: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout, failed to clean up sandbox container "dbfb5eeae5a79775cdecf317e207ec1ff1b4e6dd1e877988f3390dfb33a385ea" network for pod "ceph-pools-audit-1570654500-8g9kd": NetworkPlugin cni failed to teardown pod "ceph-pools-audit-1570654500-8g9kd_kube-system" network: Multus: error in invoke Conflist Del - "chain": error in getting result from DelNetworkList: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout]
kube-proxy logs were reporting
E1009 20:55:25.032453 1 reflector.go:125] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Endpoints: Get https://192.168.215.102:6443/api/v1/endpoints?labelSelector=%21service.kubernetes.io%2Fservice-proxy-name&limit=500&resourceVersion=0: dial tcp 192.168.215.102:6443: connect: connection refused
However, it was possible to connect manually to the IP/port. Restarting multus and kube-proxy pods on controller-1 did not help.
Reproducibility
---------------
Seen once
System Configuration
--------------------
AIODX+
Branch/Pull Time/Commit
-----------------------
Sept. 29, 2019 build
Last Pass
---------
There was no issue with previous overnight soaks
Timestamp/Logs
--------------
See full collect of controller-1 attached
Test Activity
-------------
System tests
Marking as stx.3.0 / high priority - system went into a bad state and required manual intervention to recover. Trigger and frequency are unknown; requires further investigation.