StarlingX

After an overnight soak, networking pods on standby controller failed to get cluster information

Bug #1847660 reported by Tee Ngo on 2019-10-10

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Invalid	High	Steven Webster

Bug Description

Brief Description
-----------------
After an overnight soak with no application pods, ceph-pools-audit pod remained stuck in ContainerCreating state and no application pods can start on the standby controller.

Severity
--------
Critical

Steps to Reproduce
------------------
Run various system tests and some experimental tests which entailed launching pods with different registries, pod life cycles and image pull policies.

Soak system overnight with no application pods.

Expected Behavior
------------------
Ceph-pools-audit pod can be created at each audit cycle and applications pods can again be scheduled and can run on all nodes including controller-1 after the soak.

Actual Behavior
----------------
Ceph-pools-audit pod is stuck in ContainerCreating state on controller-1 and same are the newly launched application pods.

>kubectl describe pod -n kube-system ceph-pools-audit-1570654500-8g9kd
...
  Warning FailedCreatePodSandBox 16h kubelet, controller-1 Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "bf09a329e420195005783041fd4be31e2a3ab3a1396e9a5f40ca3b69d5dc6267" network for pod "ceph-pools-audit-1570654500-8g9kd": NetworkPlugin cni failed to set up pod "ceph-pools-audit-1570654500-8g9kd_kube-system" network: Multus: Err adding pod to network "chain": Multus: error in invoke Conflist add - "chain": error in getting result from AddNetworkList: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout, failed to clean up sandbox container "bf09a329e420195005783041fd4be31e2a3ab3a1396e9a5f40ca3b69d5dc6267" network for pod "ceph-pools-audit-1570654500-8g9kd": NetworkPlugin cni failed to teardown pod "ceph-pools-audit-1570654500-8g9kd_kube-system" network: Multus: error in invoke Conflist Del - "chain": error in getting result from DelNetworkList: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout]
  Warning FailedCreatePodSandBox 16h kubelet, controller-1 Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "1c70c9f92daf52d49d036424d5b626b01285801e86c5b78f5a9db2ba497c892b" network for pod "ceph-pools-audit-1570654500-8g9kd": NetworkPlugin cni failed to set up pod "ceph-pools-audit-1570654500-8g9kd_kube-system" network: Multus: Err adding pod to network "chain": Multus: error in invoke Conflist add - "chain": error in getting result from AddNetworkList: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout, failed to clean up sandbox container "1c70c9f92daf52d49d036424d5b626b01285801e86c5b78f5a9db2ba497c892b" network for pod "ceph-pools-audit-1570654500-8g9kd": NetworkPlugin cni failed to teardown pod "ceph-pools-audit-1570654500-8g9kd_kube-system" network: Multus: error in invoke Conflist Del - "chain": error in getting result from DelNetworkList: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout]
  Normal SandboxChanged 16h (x15 over 17h) kubelet, controller-1 Pod sandbox changed, it will be killed and re-created.
  Warning FailedCreatePodSandBox 52s (x478 over 16h) kubelet, controller-1 (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "dbfb5eeae5a79775cdecf317e207ec1ff1b4e6dd1e877988f3390dfb33a385ea" network for pod "ceph-pools-audit-1570654500-8g9kd": NetworkPlugin cni failed to set up pod "ceph-pools-audit-1570654500-8g9kd_kube-system" network: Multus: Err adding pod to network "chain": Multus: error in invoke Conflist add - "chain": error in getting result from AddNetworkList: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout, failed to clean up sandbox container "dbfb5eeae5a79775cdecf317e207ec1ff1b4e6dd1e877988f3390dfb33a385ea" network for pod "ceph-pools-audit-1570654500-8g9kd": NetworkPlugin cni failed to teardown pod "ceph-pools-audit-1570654500-8g9kd_kube-system" network: Multus: error in invoke Conflist Del - "chain": error in getting result from DelNetworkList: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout]

kube-proxy logs were reporting
E1009 20:55:25.032453 1 reflector.go:125] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Endpoints: Get https://192.168.215.102:6443/api/v1/endpoints?labelSelector=%21service.kubernetes.io%2Fservice-proxy-name&limit=500&resourceVersion=0: dial tcp 192.168.215.102:6443: connect: connection refused

However, it was possible to connect manually to the IP/port. Restarting multus and kube-proxy pods on controller-1 did not help.

Reproducibility
---------------
Seen once

System Configuration
--------------------
AIODX+

Branch/Pull Time/Commit
-----------------------
Sept. 29, 2019 build

Last Pass
---------
There was no issue with previous overnight soaks

Timestamp/Logs
--------------
See full collect of controller-1 attached

Test Activity
-------------
System tests

Tags:

Revision history for this message

Tee Ngo (teewrs) wrote on 2019-10-10:

controller-1_20191010.183128.tar Edit (84.8 MiB, application/x-tar)

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-10-11:

Marking as stx.3.0 / high priority - system went into a bad state and required manual intervention to recover. Trigger and frequency are unknown; requires further investigation.

tags:	added: stx.3.0 stx.containers stx.networking
Changed in starlingx:
importance:	Undecided → High
status:	New → Triaged
assignee:	nobody → Steven Webster (swebster-wr)

Revision history for this message

Steven Webster (swebster-wr) wrote on 2019-10-25:

My investigation of this led me to believe it was a kube-proxy / iptables issue.

I was able to examine this system while the problem was occurring, and found I could not access (manually) 192.168.215.102:6443, or [10.96.0.1]:443.

I attempted to restart the kube-proxy pods, and kubelet but the problem did not go away.

Comparing the iptables from controller-0 (working) with controller-1 did not show anything obvious.
While tracing the iptables on controller-1, the system rebooted. When it recovered, everything came up properly and there were no more connectivity issues.

To debug this further, this will need to be reproduced.

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-10-28:

Issue has not been reproducible since the initial occurrence. Unfortunately, we don't have enough data to determine the root-cause.

@Tee, Please re-open if the same failure is seen again and contact Steve to investigate.