5 mins after swact calico-kube-controllers pod not getting ready

Bug #1847607 reported by Peng Peng
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Joseph Richard

Bug Description

Brief Description
-----------------
After swact-host, checking pod status. After 5 mins, calico-kube-controllers pod is still not ready. 5 mins seem too long.

Severity
--------
Major

Steps to Reproduce
------------------
as description

TC-name: mtc/test_swact.py::test_swact_controller_platform

Expected Behavior
------------------
pod get ready in a reasonable time

Actual Behavior
----------------

Reproducibility
---------------
Reproducible

System Configuration
--------------------
Multi-node system
IPv6

Lab-name: WCP_71-75

Branch/Pull Time/Commit
-----------------------
master as of
Load: 2019-10-09_20-00-00

Last Pass
---------
Unknown

Timestamp/Logs
--------------
2019-10-10 07:55:15,693] 311 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[face::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-swact controller-0'

[2019-10-10 07:56:40,684] 311 DEBUG MainThread ssh.send :: Send 'kubectl get nodes'
[2019-10-10 07:56:40,954] 433 DEBUG MainThread ssh.expect :: Output:
NAME STATUS ROLES AGE VERSION
compute-0 Ready <none> 29m v1.16.0
compute-1 Ready <none> 29m v1.16.0
compute-2 Ready <none> 29m v1.16.0
controller-0 Ready master 74m v1.16.0
controller-1 Ready master 37m v1.16.0
controller-1:~$
[

[2019-10-10 07:56:48,499] 311 DEBUG MainThread ssh.send :: Send 'kubectl get pod -o=wide --all-namespaces'
[2019-10-10 07:56:48,719] 433 DEBUG MainThread ssh.expect :: Output:
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kube-system calico-kube-controllers-7f985db75c-7qlr4 0/1 Running 3 74m dead:beef::8e22:765f:6121:eb45 controller-0 <none> <none>
k

0-10 08:01:47,379] 311 DEBUG MainThread ssh.send :: Send 'kubectl get pod -o=wide --all-namespaces'
[2019-10-10 08:01:47,596] 433 DEBUG MainThread ssh.expect :: Output:
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kube-system calico-kube-controllers-7f985db75c-7qlr4 0/1 Running 3 79m dead:beef::8e22:765f:6121:eb45 controller-0 <none> <none>

Test Activity
-------------
Sanity

Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

@Peng, what is the impact of this? How long does it take for this pod to recover?

description: updated
description: updated
Changed in starlingx:
status: New → Incomplete
Revision history for this message
Yang Liu (yliu12) wrote :

This issue is uncovered recently because we added extra check in automation to wait for Running pods Ready after each test. So the last pass indicated in description was not accurate.

Looking at previous automation logs, this issue was also already seen with 20190828T013000Z load.

I tried it manually on a ipv4 system, it eventually got ready about 15 minutes after swacted to new active controller.

Another odd things is it sometimes moves to a worker node...
controller-1:~$
[2019-09-17 13:40:11,184] 311 DEBUG MainThread ssh.send :: Send 'kubectl get pod -o=wide --all-namespaces'
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
default resource-consumer-755c664f77-5hmlr 1/1 Running 0 9m22s dead:beef::bc1b:6533:4fd4:e141 compute-2 <none> <none>
kube-system calico-kube-controllers-767467f9cf-8bbrf 1/1 Running 0 7m37s dead:beef::a2bf:c94c:345d:bc40 compute-0 <none> <none>

Revision history for this message
Yang Liu (yliu12) wrote :
Download full text (3.2 KiB)

Pod logs when this happens:

Events:
  Type Reason Age From Message
  ---- ------ ---- ---- -------
  Warning Unhealthy 21m (x2 over 22m) kubelet, controller-0 Readiness probe failed: Error reaching apiserver: Get https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded with http status code: 0; Error verifying datastore: Get https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
  Warning Unhealthy 20m (x2 over 23m) kubelet, controller-0 Readiness probe failed: Error reaching apiserver: taking a long time to check apiserver; Error verifying datastore: Get https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
  Warning Unhealthy 19m (x15 over 24m) kubelet, controller-0 Readiness probe failed: Error verifying datastore: Get https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded; Error reaching apiserver: taking a long time to check apiserver
  Warning Unhealthy 9m3s (x36 over 23m) kubelet, controller-0 Readiness probe failed: Error verifying datastore: Get https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded; Error reaching apiserver: Get https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded with http status code: 0

2019-10-11 19:47:56.516 [ERROR][1] main.go 178: Failed to verify datastore error=Get https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2019-10-11 19:48:28.516 [ERROR][1] main.go 209: Failed to reach apiserver error=Get https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2019-10-11 19:48:48.517 [ERROR][1] client.go 232: Error getting cluster information config ClusterInformation="default" error=Get https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2019-10-11 19:48:48.517 [ERROR][1] main.go 178: Failed to verify datastore error=Get https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2019-10-11 19:49:20.517 [ERROR][1] main.go 209: Failed to reach apiserver error=Get https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
E1011 19:49:35.818826 1 streamwatcher.go:109] Unable to decode an event from the watch stream: read tcp 172.16.192.68:47170->10.96.0.1:443: read: connection timed out
2019-10-11 19:49:35.818 [ERROR][1] client.go 232: Error getting cluster information config ClusterInformation="default" error=Get https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: read tcp 172.16.192.68:47170->10.96.0.1:443: read: connection timed out
2019-10-11 19:49:35.818 [ERROR][1] main.go 178: Failed to verify datastore error=Get https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: read tcp 172.16.192...

Read more...

Revision history for this message
Yang Liu (yliu12) wrote :

Logs attached.
system host-swact was triggered from controller-0 to controller-1 at around 19:33.

Revision history for this message
Peng Peng (ppeng) wrote :

During that period, fm & system cmds are all working and no extra alarms. it seems no much impact on the system.

Yang Liu (yliu12)
tags: added: stx.retestneeded
Revision history for this message
Ghada Khalil (gkhalil) wrote :

As per Brent, the impact is that no new pods can be launched until calico recovers.
Marking as stx.3.0 for now until further investigation.

Changed in starlingx:
assignee: nobody → Joseph Richard (josephrichard)
summary: - IPv6 regular, 5 mins after swact calico-kube-controllers pod not getting
- ready
+ 5 mins after swact calico-kube-controllers pod not getting ready
tags: added: stx.3.0 stx.containers stx.networking
Changed in starlingx:
importance: Undecided → Medium
status: Incomplete → Triaged
Revision history for this message
Joseph Richard (josephrichard) wrote :

This looks to be an issue with the tcp retransmit taking too long after the floating ip changes hosts.
We currently use the default of net.ipv4.tcp_retries2=15, which is around a 5 minute delay. When I set sysctl net.ipv4.tcp_retries2=3, I am no longer seeing this delay.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config-files (master)

Fix proposed to branch: master
Review: https://review.opendev.org/691675

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config-files (master)

Reviewed: https://review.opendev.org/691675
Committed: https://git.openstack.org/cgit/starlingx/config-files/commit/?id=50b7827b7d0dd8f1ff03b369c81af98ade089973
Submitter: Zuul
Branch: master

commit 50b7827b7d0dd8f1ff03b369c81af98ade089973
Author: Joseph Richard <email address hidden>
Date: Fri Oct 25 14:19:41 2019 -0400

    Decrease tcp retransmission timeout to ~100 secs

    This commit decreases the net.ipv4.tcp_retries2 to 8, which results in
    a timeout of ~100 seconds down from the default of 15, with results in
    a timeout of ~15 minutes.
    This should result in dead connections being detected faster,
    in ~100 seconds, rather than ~15 minutes, which is important when the
    floating controller ip moves during a swact.

    Closes-bug: 1847607
    Change-Id: I48c3d396bdcd81ab11d307fd62de36deffc99f12
    Signed-off-by: Joseph Richard <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Peng Peng (ppeng) wrote :

Verified on
Lab: WCP_63_66
Load: 2019-10-29_20-00-00

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.