StarlingX

5 mins after swact calico-kube-controllers pod not getting ready

Bug #1847607 reported by Peng Peng on 2019-10-10

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Joseph Richard

Bug Description

Brief Description
-----------------
After swact-host, checking pod status. After 5 mins, calico-kube-controllers pod is still not ready. 5 mins seem too long.

Severity
--------
Major

Steps to Reproduce
------------------
as description

TC-name: mtc/test_swact.py::test_swact_controller_platform

Expected Behavior
------------------
pod get ready in a reasonable time

Actual Behavior
----------------

Reproducibility
---------------
Reproducible

System Configuration
--------------------
Multi-node system
IPv6

Lab-name: WCP_71-75

Branch/Pull Time/Commit
-----------------------
master as of
Load: 2019-10-09_20-00-00

Last Pass
---------
Unknown

Timestamp/Logs
--------------
2019-10-10 07:55:15,693] 311 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[face::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-swact controller-0'

[2019-10-10 07:56:40,684] 311 DEBUG MainThread ssh.send :: Send 'kubectl get nodes'
[2019-10-10 07:56:40,954] 433 DEBUG MainThread ssh.expect :: Output:
NAME STATUS ROLES AGE VERSION
compute-0 Ready <none> 29m v1.16.0
compute-1 Ready <none> 29m v1.16.0
compute-2 Ready <none> 29m v1.16.0
controller-0 Ready master 74m v1.16.0
controller-1 Ready master 37m v1.16.0
controller-1:~$
[

[2019-10-10 07:56:48,499] 311 DEBUG MainThread ssh.send :: Send 'kubectl get pod -o=wide --all-namespaces'
[2019-10-10 07:56:48,719] 433 DEBUG MainThread ssh.expect :: Output:
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kube-system calico-kube-controllers-7f985db75c-7qlr4 0/1 Running 3 74m dead:beef::8e22:765f:6121:eb45 controller-0 <none> <none>
k

0-10 08:01:47,379] 311 DEBUG MainThread ssh.send :: Send 'kubectl get pod -o=wide --all-namespaces'
[2019-10-10 08:01:47,596] 433 DEBUG MainThread ssh.expect :: Output:
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kube-system calico-kube-controllers-7f985db75c-7qlr4 0/1 Running 3 79m dead:beef::8e22:765f:6121:eb45 controller-0 <none> <none>

Test Activity
-------------
Sanity

See original description

Tags:

Revision history for this message

Peng Peng (ppeng) wrote on 2019-10-10:

ALL_NODES_20191010.131358.tar Edit (77.3 MiB, application/x-tar)

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-10-11:

@Peng, what is the impact of this? How long does it take for this pod to recover?

description:	updated
description:	updated
Changed in starlingx:
status:	New → Incomplete

Revision history for this message

Yang Liu (yliu12) wrote on 2019-10-11:

This issue is uncovered recently because we added extra check in automation to wait for Running pods Ready after each test. So the last pass indicated in description was not accurate.

Looking at previous automation logs, this issue was also already seen with 20190828T013000Z load.

I tried it manually on a ipv4 system, it eventually got ready about 15 minutes after swacted to new active controller.

Another odd things is it sometimes moves to a worker node...
controller-1:~$
[2019-09-17 13:40:11,184] 311 DEBUG MainThread ssh.send :: Send 'kubectl get pod -o=wide --all-namespaces'
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
default resource-consumer-755c664f77-5hmlr 1/1 Running 0 9m22s dead:beef::bc1b:6533:4fd4:e141 compute-2 <none> <none>
kube-system calico-kube-controllers-767467f9cf-8bbrf 1/1 Running 0 7m37s dead:beef::a2bf:c94c:345d:bc40 compute-0 <none> <none>

Revision history for this message

Yang Liu (yliu12) wrote on 2019-10-11:

Download full text (3.2 KiB)

Pod logs when this happens:

Events:
  Type Reason Age From Message
  ---- ------ ---- ---- -------
  Warning Unhealthy 21m (x2 over 22m) kubelet, controller-0 Readiness probe failed: Error reaching apiserver: Get https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded with http status code: 0; Error verifying datastore: Get https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
  Warning Unhealthy 20m (x2 over 23m) kubelet, controller-0 Readiness probe failed: Error reaching apiserver: taking a long time to check apiserver; Error verifying datastore: Get https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
  Warning Unhealthy 19m (x15 over 24m) kubelet, controller-0 Readiness probe failed: Error verifying datastore: Get https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded; Error reaching apiserver: taking a long time to check apiserver
  Warning Unhealthy 9m3s (x36 over 23m) kubelet, controller-0 Readiness probe failed: Error verifying datastore: Get https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded; Error reaching apiserver: Get https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded with http status code: 0

2019-10-11 19:47:56.516 [ERROR][1] main.go 178: Failed to verify datastore error=Get https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2019-10-11 19:48:28.516 [ERROR][1] main.go 209: Failed to reach apiserver error=Get https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2019-10-11 19:48:48.517 [ERROR][1] client.go 232: Error getting cluster information config ClusterInformation="default" error=Get https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2019-10-11 19:48:48.517 [ERROR][1] main.go 178: Failed to verify datastore error=Get https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2019-10-11 19:49:20.517 [ERROR][1] main.go 209: Failed to reach apiserver error=Get https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
E1011 19:49:35.818826 1 streamwatcher.go:109] Unable to decode an event from the watch stream: read tcp 172.16.192.68:47170->10.96.0.1:443: read: connection timed out
2019-10-11 19:49:35.818 [ERROR][1] client.go 232: Error getting cluster information config ClusterInformation="default" error=Get https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: read tcp 172.16.192.68:47170->10.96.0.1:443: read: connection timed out
2019-10-11 19:49:35.818 [ERROR][1] main.go 178: Failed to verify datastore error=Get https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: read tcp 172.16.192...

Pod logs when this happens:

Events:
  Type     Reason     Age                  From                   Message
  ----     ------     ----                 ----                   -------
  Warning  Unhealthy  21m (x2 over 22m)    kubelet, controller-0  Readiness probe failed: Error reaching apiserver: Get https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded with http status code: 0; Error verifying datastore: Get https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
  Warning  Unhealthy  20m (x2 over 23m)    kubelet, controller-0  Readiness probe failed: Error reaching apiserver: taking a long time to check apiserver; Error verifying datastore: Get https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
  Warning  Unhealthy  19m (x15 over 24m)   kubelet, controller-0  Readiness probe failed: Error verifying datastore: Get https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded; Error reaching apiserver: taking a long time to check apiserver
  Warning  Unhealthy  9m3s (x36 over 23m)  kubelet, controller-0  Readiness probe failed: Error verifying datastore: Get https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded; Error reaching apiserver: Get https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded with http status code: 0

2019-10-11 19:47:56.516 [ERROR][1] main.go 178: Failed to verify datastore error=Get https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2019-10-11 19:48:28.516 [ERROR][1] main.go 209: Failed to reach apiserver error=Get https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2019-10-11 19:48:48.517 [ERROR][1] client.go 232: Error getting cluster information config ClusterInformation="default" error=Get https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2019-10-11 19:48:48.517 [ERROR][1] main.go 178: Failed to verify datastore error=Get https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2019-10-11 19:49:20.517 [ERROR][1] main.go 209: Failed to reach apiserver error=Get https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
E1011 19:49:35.818826       1 streamwatcher.go:109] Unable to decode an event from the watch stream: read tcp 172.16.192.68:47170->10.96.0.1:443: read: connection timed out
2019-10-11 19:49:35.818 [ERROR][1] client.go 232: Error getting cluster information config ClusterInformation="default" error=Get https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: read tcp 172.16.192.68:47170->10.96.0.1:443: read: connection timed out
2019-10-11 19:49:35.818 [ERROR][1] main.go 178: Failed to verify datastore error=Get https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: read tcp 172.16.192.68:47170->10.96.0.1:443: read: connection timed out

Revision history for this message

Yang Liu (yliu12) wrote on 2019-10-11:

SELECT_NODES_20191011.195952.tar Edit (90.0 MiB, application/x-tar)

Logs attached.
system host-swact was triggered from controller-0 to controller-1 at around 19:33.

Revision history for this message

Peng Peng (ppeng) wrote on 2019-10-11:

During that period, fm & system cmds are all working and no extra alarms. it seems no much impact on the system.

Yang Liu (yliu12) on 2019-10-16

tags:

added: stx.retestneeded

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-10-16:

As per Brent, the impact is that no new pods can be launched until calico recovers.
Marking as stx.3.0 for now until further investigation.

Changed in starlingx:
assignee:	nobody → Joseph Richard (josephrichard)
summary:	- IPv6 regular, 5 mins after swact calico-kube-controllers pod not getting - ready + 5 mins after swact calico-kube-controllers pod not getting ready
tags:	added: stx.3.0 stx.containers stx.networking
Changed in starlingx:
importance:	Undecided → Medium
status:	Incomplete → Triaged

Revision history for this message

Joseph Richard (josephrichard) wrote on 2019-10-24:

This looks to be an issue with the tcp retransmit taking too long after the floating ip changes hosts.
We currently use the default of net.ipv4.tcp_retries2=15, which is around a 5 minute delay. When I set sysctl net.ipv4.tcp_retries2=3, I am no longer seeing this delay.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-28: Fix proposed to config-files (master)

Fix proposed to branch: master
Review: https://review.opendev.org/691675

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-28: Fix merged to config-files (master)

#10

Reviewed: https://review.opendev.org/691675
Committed: https://git.openstack.org/cgit/starlingx/config-files/commit/?id=50b7827b7d0dd8f1ff03b369c81af98ade089973
Submitter: Zuul
Branch: master

commit 50b7827b7d0dd8f1ff03b369c81af98ade089973
Author: Joseph Richard <email address hidden>
Date: Fri Oct 25 14:19:41 2019 -0400

Decrease tcp retransmission timeout to ~100 secs

    This commit decreases the net.ipv4.tcp_retries2 to 8, which results in
    a timeout of ~100 seconds down from the default of 15, with results in
    a timeout of ~15 minutes.
    This should result in dead connections being detected faster,
    in ~100 seconds, rather than ~15 minutes, which is important when the
    floating controller ip moves during a swact.

    Closes-bug: 1847607
    Change-Id: I48c3d396bdcd81ab11d307fd62de36deffc99f12
    Signed-off-by: Joseph Richard <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

Peng Peng (ppeng) wrote on 2019-10-30:

#11

Verified on
Lab: WCP_63_66
Load: 2019-10-29_20-00-00

tags:

removed: stx.retestneeded

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.