StarlingX

Bug #1887438
Activity log

Activity log for bug #1887438

Date	Who	What changed	Old value	New value	Message
2020-07-13 21:20:40	Andrew Vaillancourt	bug			added bug
2020-07-13 21:23:20	Andrew Vaillancourt	summary	Controller-0 Not Ready after force rebooting active controller (Controller-1))	Controller-0 Not Ready after force rebooting active controller (Controller-1)
2020-07-15 03:44:21	Andrew Vaillancourt	description	Brief Description ----------------- After force rebooting controller-1, controller-0 did not reach 'Ready' status. Severity -------- Major Steps to Reproduce ------------------ Force reboot active controller Expected Behavior ------------------ Upon rebooting active controller, standby controller takes over in ready state, the system pods, applications, any test pods are up and running. Actual Behavior ---------------- After force rebooting controller-1, controller-0 did not reach 'Ready' status. Following pods never reached healthy status: cert-manager cm-cert-manager-856678cfb7-mmbzn 0/1 Pending 0 4h23m cert-manager cm-cert-manager-cainjector-85849bd97-7trcg 0/1 Pending 0 4h22m cert-manager cm-cert-manager-webhook-5745478cbc-8k2m7 0/1 Pending 0 4h22m kube-system coredns-78d9fd7cb9-7bdw9 0/1 Pending 0 4h22m kube-system ic-nginx-ingress-default-backend-5ffcfd7744-zr4wj 0/1 Terminating 0 4h23m kube-system rbd-provisioner-77bfb6dbb-7pglp 0/1 Pending 0 4h22m Reproducibility --------------- Unknown System Configuration -------------------- Standard System 2 Controllers and 3 Computes LAB: WCP_71_75 Branch/Pull Time/Commit ----------------------- BUILD_ID="2020-07-13_00-00-00" BUILD_DATE="2020-07-13 00:05:40 -0400" Last Pass --------- N/A Timestamp/Logs -------------- Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling <unknown> default-scheduler 0/5 nodes are available: 1 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate, 1 node(s) had taint {services: disabled}, that the pod didn't tolerate, 3 node(s) didn't match node selector. Warning FailedScheduling <unknown> default-scheduler 0/5 nodes are available: 1 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate, 1 node(s) had taint {services: disabled}, that the pod didn't tolerate, 3 node(s) didn't match node selector. Warning FailedScheduling <unknown> default-scheduler 0/5 nodes are available: 1 node(s) didn't match pod affinity/anti-affinity, 1 node(s) didn't satisfy existing pods anti-affinity rules, 1 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate, 3 node(s) didn't match node selector. Test Activity ------------- System Test Automation Development Workaround ---------- N/A	Brief Description ----------------- After force rebooting controller-1, controller-0 did not reach 'Ready' status. Severity -------- Major Steps to Reproduce ------------------ Force reboot active controller Expected Behavior ------------------ Upon rebooting active controller, standby controller takes over in ready state, the system pods, applications, any test pods are up and running. Actual Behavior ---------------- After force rebooting controller-1, controller-0 did not reach 'Ready' status. Following pods never reached healthy status: cert-manager cm-cert-manager-856678cfb7-mmbzn 0/1 Pending 0 4h23m cert-manager cm-cert-manager-cainjector-85849bd97-7trcg 0/1 Pending 0 4h22m cert-manager cm-cert-manager-webhook-5745478cbc-8k2m7 0/1 Pending 0 4h22m kube-system coredns-78d9fd7cb9-7bdw9 0/1 Pending 0 4h22m kube-system ic-nginx-ingress-default-backend-5ffcfd7744-zr4wj 0/1 Terminating 0 4h23m kube-system rbd-provisioner-77bfb6dbb-7pglp 0/1 Pending 0 4h22m Reproducibility --------------- Reproduced on same lab with 2 diff builds. System Configuration -------------------- Standard System 2 Controllers and 3 Computes LAB: WCP_71_75 Branch/Pull Time/Commit ----------------------- first failure BUILD_ID="2020-07-13_00-00-00" BUILD_DATE="2020-07-13 00:05:40 -0400" second failure BUILD_ID="2020-07-14_00-00-00" BUILD_DATE="2020-07-14 00:05:40 -0400" Last Pass --------- Build BUILD_ID="2020-07-14_00-00-00" force reboot of controller-0 saw expected behaviour. Following this system recovery, controller-1 was force rebooted and controller-0 did not reach ready status, some system and app pods were stuck in terminating/pending state, reproducing issue seen in the previous build. Timestamp/Logs -------------- Collect all logs/ describe unhealthy pods: https://files.starlingx.kube.cengn.ca/launchpad/1887438 Test Activity ------------- System Test Automation Development Workaround ---------- N/A
2020-07-15 20:14:20	Ghada Khalil	tags		stx.containers
2020-07-15 20:14:26	Ghada Khalil	starlingx: status	New	Incomplete
2020-07-15 20:31:03	Andrew Vaillancourt	description	Brief Description ----------------- After force rebooting controller-1, controller-0 did not reach 'Ready' status. Severity -------- Major Steps to Reproduce ------------------ Force reboot active controller Expected Behavior ------------------ Upon rebooting active controller, standby controller takes over in ready state, the system pods, applications, any test pods are up and running. Actual Behavior ---------------- After force rebooting controller-1, controller-0 did not reach 'Ready' status. Following pods never reached healthy status: cert-manager cm-cert-manager-856678cfb7-mmbzn 0/1 Pending 0 4h23m cert-manager cm-cert-manager-cainjector-85849bd97-7trcg 0/1 Pending 0 4h22m cert-manager cm-cert-manager-webhook-5745478cbc-8k2m7 0/1 Pending 0 4h22m kube-system coredns-78d9fd7cb9-7bdw9 0/1 Pending 0 4h22m kube-system ic-nginx-ingress-default-backend-5ffcfd7744-zr4wj 0/1 Terminating 0 4h23m kube-system rbd-provisioner-77bfb6dbb-7pglp 0/1 Pending 0 4h22m Reproducibility --------------- Reproduced on same lab with 2 diff builds. System Configuration -------------------- Standard System 2 Controllers and 3 Computes LAB: WCP_71_75 Branch/Pull Time/Commit ----------------------- first failure BUILD_ID="2020-07-13_00-00-00" BUILD_DATE="2020-07-13 00:05:40 -0400" second failure BUILD_ID="2020-07-14_00-00-00" BUILD_DATE="2020-07-14 00:05:40 -0400" Last Pass --------- Build BUILD_ID="2020-07-14_00-00-00" force reboot of controller-0 saw expected behaviour. Following this system recovery, controller-1 was force rebooted and controller-0 did not reach ready status, some system and app pods were stuck in terminating/pending state, reproducing issue seen in the previous build. Timestamp/Logs -------------- Collect all logs/ describe unhealthy pods: https://files.starlingx.kube.cengn.ca/launchpad/1887438 Test Activity ------------- System Test Automation Development Workaround ---------- N/A	Brief Description ----------------- After force rebooting controller-1, controller-0 did not reach 'Ready' status. Severity -------- Major Steps to Reproduce ------------------ Force reboot active controller Expected Behavior ------------------ Upon rebooting active controller, standby controller takes over in ready state, the system pods, applications, any test pods are up and running. Actual Behavior ---------------- After force rebooting controller-1, controller-0 did not reach 'Ready' status. controller-0:~$ kubectl get nodes NAME STATUS ROLES AGE VERSION compute-0 Ready <none> 8h v1.18.1 compute-1 Ready <none> 8h v1.18.1 compute-2 Ready <none> 8h v1.18.1 controller-0 NotReady master 9h v1.18.1 controller-1 Ready master 8h v1.18.1 Following pods never reached healthy status: cert-manager cm-cert-manager-856678cfb7-mmbzn 0/1 Pending 0 4h23m cert-manager cm-cert-manager-cainjector-85849bd97-7trcg 0/1 Pending 0 4h22m cert-manager cm-cert-manager-webhook-5745478cbc-8k2m7 0/1 Pending 0 4h22m kube-system coredns-78d9fd7cb9-7bdw9 0/1 Pending 0 4h22m kube-system ic-nginx-ingress-default-backend-5ffcfd7744-zr4wj 0/1 Terminating 0 4h23m kube-system rbd-provisioner-77bfb6dbb-7pglp 0/1 Pending 0 4h22m Reproducibility --------------- Reproduced on same lab with 2 diff builds. System Configuration -------------------- Standard System 2 Controllers and 3 Computes LAB: WCP_71_75 Branch/Pull Time/Commit ----------------------- first failure BUILD_ID="2020-07-13_00-00-00" BUILD_DATE="2020-07-13 00:05:40 -0400" second failure BUILD_ID="2020-07-14_00-00-00" BUILD_DATE="2020-07-14 00:05:40 -0400" Last Pass --------- Build BUILD_ID="2020-07-14_00-00-00" force reboot of controller-0 saw expected behaviour. Following this system recovery, controller-1 was force rebooted and controller-0 did not reach ready status, some system and app pods were stuck in terminating/pending state, reproducing issue seen in the previous build. Timestamp/Logs -------------- Collect all logs/ describe unhealthy pods: https://files.starlingx.kube.cengn.ca/launchpad/1887438 Test Activity ------------- System Test Automation Development Workaround ---------- N/A
2020-07-20 23:27:34	Chris Friesen	bug watch added		https://github.com/kubernetes/kubernetes/issues/93268
2020-07-20 23:27:34	Chris Friesen	bug watch added		https://github.com/golang/go/issues/40213
2020-07-20 23:27:42	Chris Friesen	starlingx: status	Incomplete	Confirmed
2020-07-22 18:00:27	Ghada Khalil	starlingx: importance	Undecided	Low
2020-07-22 18:02:02	Ghada Khalil	tags	stx.containers	stx.5.0 stx.containers
2020-07-22 18:03:28	Ghada Khalil	starlingx: importance	Low	Medium
2020-07-22 18:03:38	Ghada Khalil	starlingx: assignee		Frank Miller (sensfan22)
2020-07-22 18:20:54	Andrew Vaillancourt	description	Brief Description ----------------- After force rebooting controller-1, controller-0 did not reach 'Ready' status. Severity -------- Major Steps to Reproduce ------------------ Force reboot active controller Expected Behavior ------------------ Upon rebooting active controller, standby controller takes over in ready state, the system pods, applications, any test pods are up and running. Actual Behavior ---------------- After force rebooting controller-1, controller-0 did not reach 'Ready' status. controller-0:~$ kubectl get nodes NAME STATUS ROLES AGE VERSION compute-0 Ready <none> 8h v1.18.1 compute-1 Ready <none> 8h v1.18.1 compute-2 Ready <none> 8h v1.18.1 controller-0 NotReady master 9h v1.18.1 controller-1 Ready master 8h v1.18.1 Following pods never reached healthy status: cert-manager cm-cert-manager-856678cfb7-mmbzn 0/1 Pending 0 4h23m cert-manager cm-cert-manager-cainjector-85849bd97-7trcg 0/1 Pending 0 4h22m cert-manager cm-cert-manager-webhook-5745478cbc-8k2m7 0/1 Pending 0 4h22m kube-system coredns-78d9fd7cb9-7bdw9 0/1 Pending 0 4h22m kube-system ic-nginx-ingress-default-backend-5ffcfd7744-zr4wj 0/1 Terminating 0 4h23m kube-system rbd-provisioner-77bfb6dbb-7pglp 0/1 Pending 0 4h22m Reproducibility --------------- Reproduced on same lab with 2 diff builds. System Configuration -------------------- Standard System 2 Controllers and 3 Computes LAB: WCP_71_75 Branch/Pull Time/Commit ----------------------- first failure BUILD_ID="2020-07-13_00-00-00" BUILD_DATE="2020-07-13 00:05:40 -0400" second failure BUILD_ID="2020-07-14_00-00-00" BUILD_DATE="2020-07-14 00:05:40 -0400" Last Pass --------- Build BUILD_ID="2020-07-14_00-00-00" force reboot of controller-0 saw expected behaviour. Following this system recovery, controller-1 was force rebooted and controller-0 did not reach ready status, some system and app pods were stuck in terminating/pending state, reproducing issue seen in the previous build. Timestamp/Logs -------------- Collect all logs/ describe unhealthy pods: https://files.starlingx.kube.cengn.ca/launchpad/1887438 Test Activity ------------- System Test Automation Development Workaround ---------- N/A	Brief Description ----------------- After force rebooting controller-1, controller-0 did not reach 'Ready' status. Severity -------- Major Steps to Reproduce ------------------ Force reboot active controller Expected Behavior ------------------ Upon rebooting active controller, standby controller takes over in ready state, the system pods, applications, any test pods are up and running. Actual Behavior ---------------- After force rebooting controller-1, controller-0 did not reach 'Ready' status. controller-0:~$ kubectl get nodes NAME STATUS ROLES AGE VERSION compute-0 Ready <none> 8h v1.18.1 compute-1 Ready <none> 8h v1.18.1 compute-2 Ready <none> 8h v1.18.1 controller-0 NotReady master 9h v1.18.1 controller-1 Ready master 8h v1.18.1 Following pods never reached healthy status: cert-manager cm-cert-manager-856678cfb7-mmbzn 0/1 Pending 0 4h23m cert-manager cm-cert-manager-cainjector-85849bd97-7trcg 0/1 Pending 0 4h22m cert-manager cm-cert-manager-webhook-5745478cbc-8k2m7 0/1 Pending 0 4h22m kube-system coredns-78d9fd7cb9-7bdw9 0/1 Pending 0 4h22m kube-system ic-nginx-ingress-default-backend-5ffcfd7744-zr4wj 0/1 Terminating 0 4h23m kube-system rbd-provisioner-77bfb6dbb-7pglp 0/1 Pending 0 4h22m Reproducibility --------------- Reproduced on same lab with 2 diff builds. System Configuration -------------------- Standard System 2 Controllers and 3 Computes LAB: WCP_71_75 Branch/Pull Time/Commit ----------------------- first failure BUILD_ID="2020-07-13_00-00-00" BUILD_DATE="2020-07-13 00:05:40 -0400" second failure BUILD_ID="2020-07-14_00-00-00" BUILD_DATE="2020-07-14 00:05:40 -0400" Last Pass --------- Build BUILD_ID="2020-07-14_00-00-00" force reboot of controller-0 saw expected behaviour. Following this system recovery, controller-1 was force rebooted and controller-0 did not reach ready status, some system and app pods were stuck in terminating/pending state, reproducing issue seen in the previous build. Timestamp/Logs -------------- Collect all logs/ describe unhealthy pods: https://files.starlingx.kube.cengn.ca/launchpad/1887438 Test Activity ------------- System Test Automation Development Workaround ---------- Possible workaround: From https://github.com/kubernetes/kubernetes/issues/93268: "... after all nodes were running again [...] restarting kubelet on the "NotReady" node was enough to make it go "Ready" again."
2021-03-26 21:36:58	Frank Miller	starlingx: assignee	Frank Miller (sensfan22)	Chris Friesen (cbf123)
2021-04-15 21:49:59	Frank Miller	tags	stx.5.0 stx.containers	stx.6.0 stx.containers
2021-06-02 18:12:55	OpenStack Infra	tags	stx.6.0 stx.containers	in-f-centos8 stx.6.0 stx.containers
2021-06-02 18:12:55	OpenStack Infra	cve linked		2019-10160
2021-06-02 18:12:55	OpenStack Infra	cve linked		2019-16056
2021-06-02 18:12:55	OpenStack Infra	cve linked		2019-9636
2021-06-02 18:12:55	OpenStack Infra	cve linked		2019-9924
2021-06-02 18:12:55	OpenStack Infra	cve linked		2019-9948
2021-06-07 17:19:00	OpenStack Infra	bug watch added		https://bugzilla.redhat.com/show_bug.cgi?id=1793527
2021-06-07 17:19:00	OpenStack Infra	bug watch added		https://bugzilla.redhat.com/show_bug.cgi?id=1819868
2021-06-07 17:19:00	OpenStack Infra	cve linked		2018-15473
2021-06-07 17:19:00	OpenStack Infra	cve linked		2019-18634
2021-06-07 17:19:00	OpenStack Infra	cve linked		2019-6470
2021-06-07 17:19:00	OpenStack Infra	cve linked		2020-13817
2021-06-07 17:19:00	OpenStack Infra	cve linked		2020-15705
2021-06-07 17:19:00	OpenStack Infra	cve linked		2020-15707
2021-06-07 17:19:00	OpenStack Infra	cve linked		2021-3156
2021-11-29 16:07:50	Chris Friesen	starlingx: status	Confirmed	Fix Released