2020-07-13 21:20:40 |
Andrew Vaillancourt |
bug |
|
|
added bug |
2020-07-13 21:23:20 |
Andrew Vaillancourt |
summary |
Controller-0 Not Ready after force rebooting active controller (Controller-1)) |
Controller-0 Not Ready after force rebooting active controller (Controller-1) |
|
2020-07-15 03:44:21 |
Andrew Vaillancourt |
description |
Brief Description
-----------------
After force rebooting controller-1, controller-0 did not reach 'Ready' status.
Severity
--------
Major
Steps to Reproduce
------------------
Force reboot active controller
Expected Behavior
------------------
Upon rebooting active controller, standby controller takes over in ready state, the system pods, applications, any test pods are up and running.
Actual Behavior
----------------
After force rebooting controller-1, controller-0 did not reach 'Ready' status.
Following pods never reached healthy status:
cert-manager cm-cert-manager-856678cfb7-mmbzn 0/1 Pending 0 4h23m
cert-manager cm-cert-manager-cainjector-85849bd97-7trcg 0/1 Pending 0 4h22m
cert-manager cm-cert-manager-webhook-5745478cbc-8k2m7 0/1 Pending 0 4h22m
kube-system coredns-78d9fd7cb9-7bdw9 0/1 Pending 0 4h22m
kube-system ic-nginx-ingress-default-backend-5ffcfd7744-zr4wj 0/1 Terminating 0 4h23m
kube-system rbd-provisioner-77bfb6dbb-7pglp 0/1 Pending 0 4h22m
Reproducibility
---------------
Unknown
System Configuration
--------------------
Standard System
2 Controllers and 3 Computes
LAB: WCP_71_75
Branch/Pull Time/Commit
-----------------------
BUILD_ID="2020-07-13_00-00-00"
BUILD_DATE="2020-07-13 00:05:40 -0400"
Last Pass
---------
N/A
Timestamp/Logs
--------------
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling <unknown> default-scheduler 0/5 nodes are available: 1 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate, 1 node(s) had taint {services: disabled}, that the pod didn't tolerate, 3 node(s) didn't match node selector.
Warning FailedScheduling <unknown> default-scheduler 0/5 nodes are available: 1 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate, 1 node(s) had taint {services: disabled}, that the pod didn't tolerate, 3 node(s) didn't match node selector.
Warning FailedScheduling <unknown> default-scheduler 0/5 nodes are available: 1 node(s) didn't match pod affinity/anti-affinity, 1 node(s) didn't satisfy existing pods anti-affinity rules, 1 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate, 3 node(s) didn't match node selector.
Test Activity
-------------
System Test Automation Development
Workaround
----------
N/A |
Brief Description
-----------------
After force rebooting controller-1, controller-0 did not reach 'Ready' status.
Severity
--------
Major
Steps to Reproduce
------------------
Force reboot active controller
Expected Behavior
------------------
Upon rebooting active controller, standby controller takes over in ready state, the system pods, applications, any test pods are up and running.
Actual Behavior
----------------
After force rebooting controller-1, controller-0 did not reach 'Ready' status.
Following pods never reached healthy status:
cert-manager cm-cert-manager-856678cfb7-mmbzn 0/1 Pending 0 4h23m
cert-manager cm-cert-manager-cainjector-85849bd97-7trcg 0/1 Pending 0 4h22m
cert-manager cm-cert-manager-webhook-5745478cbc-8k2m7 0/1 Pending 0 4h22m
kube-system coredns-78d9fd7cb9-7bdw9 0/1 Pending 0 4h22m
kube-system ic-nginx-ingress-default-backend-5ffcfd7744-zr4wj 0/1 Terminating 0 4h23m
kube-system rbd-provisioner-77bfb6dbb-7pglp 0/1 Pending 0 4h22m
Reproducibility
---------------
Reproduced on same lab with 2 diff builds.
System Configuration
--------------------
Standard System
2 Controllers and 3 Computes
LAB: WCP_71_75
Branch/Pull Time/Commit
-----------------------
first failure
BUILD_ID="2020-07-13_00-00-00"
BUILD_DATE="2020-07-13 00:05:40 -0400"
second failure
BUILD_ID="2020-07-14_00-00-00"
BUILD_DATE="2020-07-14 00:05:40 -0400"
Last Pass
---------
Build BUILD_ID="2020-07-14_00-00-00" force reboot of controller-0 saw expected behaviour. Following this system recovery, controller-1 was force rebooted and controller-0 did not reach ready status, some system and app pods were stuck in terminating/pending state, reproducing issue seen in the previous build.
Timestamp/Logs
--------------
Collect all logs/ describe unhealthy pods: https://files.starlingx.kube.cengn.ca/launchpad/1887438
Test Activity
-------------
System Test Automation Development
Workaround
----------
N/A |
|
2020-07-15 20:14:20 |
Ghada Khalil |
tags |
|
stx.containers |
|
2020-07-15 20:14:26 |
Ghada Khalil |
starlingx: status |
New |
Incomplete |
|
2020-07-15 20:31:03 |
Andrew Vaillancourt |
description |
Brief Description
-----------------
After force rebooting controller-1, controller-0 did not reach 'Ready' status.
Severity
--------
Major
Steps to Reproduce
------------------
Force reboot active controller
Expected Behavior
------------------
Upon rebooting active controller, standby controller takes over in ready state, the system pods, applications, any test pods are up and running.
Actual Behavior
----------------
After force rebooting controller-1, controller-0 did not reach 'Ready' status.
Following pods never reached healthy status:
cert-manager cm-cert-manager-856678cfb7-mmbzn 0/1 Pending 0 4h23m
cert-manager cm-cert-manager-cainjector-85849bd97-7trcg 0/1 Pending 0 4h22m
cert-manager cm-cert-manager-webhook-5745478cbc-8k2m7 0/1 Pending 0 4h22m
kube-system coredns-78d9fd7cb9-7bdw9 0/1 Pending 0 4h22m
kube-system ic-nginx-ingress-default-backend-5ffcfd7744-zr4wj 0/1 Terminating 0 4h23m
kube-system rbd-provisioner-77bfb6dbb-7pglp 0/1 Pending 0 4h22m
Reproducibility
---------------
Reproduced on same lab with 2 diff builds.
System Configuration
--------------------
Standard System
2 Controllers and 3 Computes
LAB: WCP_71_75
Branch/Pull Time/Commit
-----------------------
first failure
BUILD_ID="2020-07-13_00-00-00"
BUILD_DATE="2020-07-13 00:05:40 -0400"
second failure
BUILD_ID="2020-07-14_00-00-00"
BUILD_DATE="2020-07-14 00:05:40 -0400"
Last Pass
---------
Build BUILD_ID="2020-07-14_00-00-00" force reboot of controller-0 saw expected behaviour. Following this system recovery, controller-1 was force rebooted and controller-0 did not reach ready status, some system and app pods were stuck in terminating/pending state, reproducing issue seen in the previous build.
Timestamp/Logs
--------------
Collect all logs/ describe unhealthy pods: https://files.starlingx.kube.cengn.ca/launchpad/1887438
Test Activity
-------------
System Test Automation Development
Workaround
----------
N/A |
Brief Description
-----------------
After force rebooting controller-1, controller-0 did not reach 'Ready' status.
Severity
--------
Major
Steps to Reproduce
------------------
Force reboot active controller
Expected Behavior
------------------
Upon rebooting active controller, standby controller takes over in ready state, the system pods, applications, any test pods are up and running.
Actual Behavior
----------------
After force rebooting controller-1, controller-0 did not reach 'Ready' status.
controller-0:~$ kubectl get nodes NAME STATUS ROLES AGE VERSION compute-0 Ready <none> 8h v1.18.1 compute-1 Ready <none> 8h v1.18.1 compute-2 Ready <none> 8h v1.18.1 controller-0 NotReady master 9h v1.18.1 controller-1 Ready master 8h v1.18.1
Following pods never reached healthy status:
cert-manager cm-cert-manager-856678cfb7-mmbzn 0/1 Pending 0 4h23m
cert-manager cm-cert-manager-cainjector-85849bd97-7trcg 0/1 Pending 0 4h22m
cert-manager cm-cert-manager-webhook-5745478cbc-8k2m7 0/1 Pending 0 4h22m
kube-system coredns-78d9fd7cb9-7bdw9 0/1 Pending 0 4h22m
kube-system ic-nginx-ingress-default-backend-5ffcfd7744-zr4wj 0/1 Terminating 0 4h23m
kube-system rbd-provisioner-77bfb6dbb-7pglp 0/1 Pending 0 4h22m
Reproducibility
---------------
Reproduced on same lab with 2 diff builds.
System Configuration
--------------------
Standard System
2 Controllers and 3 Computes
LAB: WCP_71_75
Branch/Pull Time/Commit
-----------------------
first failure
BUILD_ID="2020-07-13_00-00-00"
BUILD_DATE="2020-07-13 00:05:40 -0400"
second failure
BUILD_ID="2020-07-14_00-00-00"
BUILD_DATE="2020-07-14 00:05:40 -0400"
Last Pass
---------
Build BUILD_ID="2020-07-14_00-00-00" force reboot of controller-0 saw expected behaviour. Following this system recovery, controller-1 was force rebooted and controller-0 did not reach ready status, some system and app pods were stuck in terminating/pending state, reproducing issue seen in the previous build.
Timestamp/Logs
--------------
Collect all logs/ describe unhealthy pods: https://files.starlingx.kube.cengn.ca/launchpad/1887438
Test Activity
-------------
System Test Automation Development
Workaround
----------
N/A |
|
2020-07-20 23:27:34 |
Chris Friesen |
bug watch added |
|
https://github.com/kubernetes/kubernetes/issues/93268 |
|
2020-07-20 23:27:34 |
Chris Friesen |
bug watch added |
|
https://github.com/golang/go/issues/40213 |
|
2020-07-20 23:27:42 |
Chris Friesen |
starlingx: status |
Incomplete |
Confirmed |
|
2020-07-22 18:00:27 |
Ghada Khalil |
starlingx: importance |
Undecided |
Low |
|
2020-07-22 18:02:02 |
Ghada Khalil |
tags |
stx.containers |
stx.5.0 stx.containers |
|
2020-07-22 18:03:28 |
Ghada Khalil |
starlingx: importance |
Low |
Medium |
|
2020-07-22 18:03:38 |
Ghada Khalil |
starlingx: assignee |
|
Frank Miller (sensfan22) |
|
2020-07-22 18:20:54 |
Andrew Vaillancourt |
description |
Brief Description
-----------------
After force rebooting controller-1, controller-0 did not reach 'Ready' status.
Severity
--------
Major
Steps to Reproduce
------------------
Force reboot active controller
Expected Behavior
------------------
Upon rebooting active controller, standby controller takes over in ready state, the system pods, applications, any test pods are up and running.
Actual Behavior
----------------
After force rebooting controller-1, controller-0 did not reach 'Ready' status.
controller-0:~$ kubectl get nodes NAME STATUS ROLES AGE VERSION compute-0 Ready <none> 8h v1.18.1 compute-1 Ready <none> 8h v1.18.1 compute-2 Ready <none> 8h v1.18.1 controller-0 NotReady master 9h v1.18.1 controller-1 Ready master 8h v1.18.1
Following pods never reached healthy status:
cert-manager cm-cert-manager-856678cfb7-mmbzn 0/1 Pending 0 4h23m
cert-manager cm-cert-manager-cainjector-85849bd97-7trcg 0/1 Pending 0 4h22m
cert-manager cm-cert-manager-webhook-5745478cbc-8k2m7 0/1 Pending 0 4h22m
kube-system coredns-78d9fd7cb9-7bdw9 0/1 Pending 0 4h22m
kube-system ic-nginx-ingress-default-backend-5ffcfd7744-zr4wj 0/1 Terminating 0 4h23m
kube-system rbd-provisioner-77bfb6dbb-7pglp 0/1 Pending 0 4h22m
Reproducibility
---------------
Reproduced on same lab with 2 diff builds.
System Configuration
--------------------
Standard System
2 Controllers and 3 Computes
LAB: WCP_71_75
Branch/Pull Time/Commit
-----------------------
first failure
BUILD_ID="2020-07-13_00-00-00"
BUILD_DATE="2020-07-13 00:05:40 -0400"
second failure
BUILD_ID="2020-07-14_00-00-00"
BUILD_DATE="2020-07-14 00:05:40 -0400"
Last Pass
---------
Build BUILD_ID="2020-07-14_00-00-00" force reboot of controller-0 saw expected behaviour. Following this system recovery, controller-1 was force rebooted and controller-0 did not reach ready status, some system and app pods were stuck in terminating/pending state, reproducing issue seen in the previous build.
Timestamp/Logs
--------------
Collect all logs/ describe unhealthy pods: https://files.starlingx.kube.cengn.ca/launchpad/1887438
Test Activity
-------------
System Test Automation Development
Workaround
----------
N/A |
Brief Description
-----------------
After force rebooting controller-1, controller-0 did not reach 'Ready' status.
Severity
--------
Major
Steps to Reproduce
------------------
Force reboot active controller
Expected Behavior
------------------
Upon rebooting active controller, standby controller takes over in ready state, the system pods, applications, any test pods are up and running.
Actual Behavior
----------------
After force rebooting controller-1, controller-0 did not reach 'Ready' status.
controller-0:~$ kubectl get nodes NAME STATUS ROLES AGE VERSION compute-0 Ready <none> 8h v1.18.1 compute-1 Ready <none> 8h v1.18.1 compute-2 Ready <none> 8h v1.18.1 controller-0 NotReady master 9h v1.18.1 controller-1 Ready master 8h v1.18.1
Following pods never reached healthy status:
cert-manager cm-cert-manager-856678cfb7-mmbzn 0/1 Pending 0 4h23m
cert-manager cm-cert-manager-cainjector-85849bd97-7trcg 0/1 Pending 0 4h22m
cert-manager cm-cert-manager-webhook-5745478cbc-8k2m7 0/1 Pending 0 4h22m
kube-system coredns-78d9fd7cb9-7bdw9 0/1 Pending 0 4h22m
kube-system ic-nginx-ingress-default-backend-5ffcfd7744-zr4wj 0/1 Terminating 0 4h23m
kube-system rbd-provisioner-77bfb6dbb-7pglp 0/1 Pending 0 4h22m
Reproducibility
---------------
Reproduced on same lab with 2 diff builds.
System Configuration
--------------------
Standard System
2 Controllers and 3 Computes
LAB: WCP_71_75
Branch/Pull Time/Commit
-----------------------
first failure
BUILD_ID="2020-07-13_00-00-00"
BUILD_DATE="2020-07-13 00:05:40 -0400"
second failure
BUILD_ID="2020-07-14_00-00-00"
BUILD_DATE="2020-07-14 00:05:40 -0400"
Last Pass
---------
Build BUILD_ID="2020-07-14_00-00-00" force reboot of controller-0 saw expected behaviour. Following this system recovery, controller-1 was force rebooted and controller-0 did not reach ready status, some system and app pods were stuck in terminating/pending state, reproducing issue seen in the previous build.
Timestamp/Logs
--------------
Collect all logs/ describe unhealthy pods: https://files.starlingx.kube.cengn.ca/launchpad/1887438
Test Activity
-------------
System Test Automation Development
Workaround
----------
Possible workaround:
From https://github.com/kubernetes/kubernetes/issues/93268:
"... after all nodes were running again [...] restarting kubelet on the "NotReady" node was enough to make it go "Ready" again." |
|
2021-03-26 21:36:58 |
Frank Miller |
starlingx: assignee |
Frank Miller (sensfan22) |
Chris Friesen (cbf123) |
|
2021-04-15 21:49:59 |
Frank Miller |
tags |
stx.5.0 stx.containers |
stx.6.0 stx.containers |
|
2021-06-02 18:12:55 |
OpenStack Infra |
tags |
stx.6.0 stx.containers |
in-f-centos8 stx.6.0 stx.containers |
|
2021-06-02 18:12:55 |
OpenStack Infra |
cve linked |
|
2019-10160 |
|
2021-06-02 18:12:55 |
OpenStack Infra |
cve linked |
|
2019-16056 |
|
2021-06-02 18:12:55 |
OpenStack Infra |
cve linked |
|
2019-9636 |
|
2021-06-02 18:12:55 |
OpenStack Infra |
cve linked |
|
2019-9924 |
|
2021-06-02 18:12:55 |
OpenStack Infra |
cve linked |
|
2019-9948 |
|
2021-06-07 17:19:00 |
OpenStack Infra |
bug watch added |
|
https://bugzilla.redhat.com/show_bug.cgi?id=1793527 |
|
2021-06-07 17:19:00 |
OpenStack Infra |
bug watch added |
|
https://bugzilla.redhat.com/show_bug.cgi?id=1819868 |
|
2021-06-07 17:19:00 |
OpenStack Infra |
cve linked |
|
2018-15473 |
|
2021-06-07 17:19:00 |
OpenStack Infra |
cve linked |
|
2019-18634 |
|
2021-06-07 17:19:00 |
OpenStack Infra |
cve linked |
|
2019-6470 |
|
2021-06-07 17:19:00 |
OpenStack Infra |
cve linked |
|
2020-13817 |
|
2021-06-07 17:19:00 |
OpenStack Infra |
cve linked |
|
2020-15705 |
|
2021-06-07 17:19:00 |
OpenStack Infra |
cve linked |
|
2020-15707 |
|
2021-06-07 17:19:00 |
OpenStack Infra |
cve linked |
|
2021-3156 |
|
2021-11-29 16:07:50 |
Chris Friesen |
starlingx: status |
Confirmed |
Fix Released |
|