Activity log for bug #1887438

Date Who What changed Old value New value Message
2020-07-13 21:20:40 Andrew Vaillancourt bug added bug
2020-07-13 21:23:20 Andrew Vaillancourt summary Controller-0 Not Ready after force rebooting active controller (Controller-1)) Controller-0 Not Ready after force rebooting active controller (Controller-1)
2020-07-15 03:44:21 Andrew Vaillancourt description Brief Description ----------------- After force rebooting controller-1, controller-0 did not reach 'Ready' status. Severity -------- Major Steps to Reproduce ------------------ Force reboot active controller Expected Behavior ------------------ Upon rebooting active controller, standby controller takes over in ready state, the system pods, applications, any test pods are up and running. Actual Behavior ---------------- After force rebooting controller-1, controller-0 did not reach 'Ready' status. Following pods never reached healthy status: cert-manager cm-cert-manager-856678cfb7-mmbzn 0/1 Pending 0 4h23m cert-manager cm-cert-manager-cainjector-85849bd97-7trcg 0/1 Pending 0 4h22m cert-manager cm-cert-manager-webhook-5745478cbc-8k2m7 0/1 Pending 0 4h22m kube-system coredns-78d9fd7cb9-7bdw9 0/1 Pending 0 4h22m kube-system ic-nginx-ingress-default-backend-5ffcfd7744-zr4wj 0/1 Terminating 0 4h23m kube-system rbd-provisioner-77bfb6dbb-7pglp 0/1 Pending 0 4h22m Reproducibility --------------- Unknown System Configuration -------------------- Standard System 2 Controllers and 3 Computes LAB: WCP_71_75 Branch/Pull Time/Commit ----------------------- BUILD_ID="2020-07-13_00-00-00" BUILD_DATE="2020-07-13 00:05:40 -0400" Last Pass --------- N/A Timestamp/Logs -------------- Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling <unknown> default-scheduler 0/5 nodes are available: 1 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate, 1 node(s) had taint {services: disabled}, that the pod didn't tolerate, 3 node(s) didn't match node selector. Warning FailedScheduling <unknown> default-scheduler 0/5 nodes are available: 1 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate, 1 node(s) had taint {services: disabled}, that the pod didn't tolerate, 3 node(s) didn't match node selector. Warning FailedScheduling <unknown> default-scheduler 0/5 nodes are available: 1 node(s) didn't match pod affinity/anti-affinity, 1 node(s) didn't satisfy existing pods anti-affinity rules, 1 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate, 3 node(s) didn't match node selector. Test Activity ------------- System Test Automation Development Workaround ---------- N/A Brief Description ----------------- After force rebooting controller-1, controller-0 did not reach 'Ready' status. Severity -------- Major Steps to Reproduce ------------------ Force reboot active controller Expected Behavior ------------------ Upon rebooting active controller, standby controller takes over in ready state, the system pods, applications, any test pods are up and running. Actual Behavior ---------------- After force rebooting controller-1, controller-0 did not reach 'Ready' status. Following pods never reached healthy status: cert-manager cm-cert-manager-856678cfb7-mmbzn 0/1 Pending 0 4h23m cert-manager cm-cert-manager-cainjector-85849bd97-7trcg 0/1 Pending 0 4h22m cert-manager cm-cert-manager-webhook-5745478cbc-8k2m7 0/1 Pending 0 4h22m kube-system coredns-78d9fd7cb9-7bdw9 0/1 Pending 0 4h22m kube-system ic-nginx-ingress-default-backend-5ffcfd7744-zr4wj 0/1 Terminating 0 4h23m kube-system rbd-provisioner-77bfb6dbb-7pglp 0/1 Pending 0 4h22m Reproducibility --------------- Reproduced on same lab with 2 diff builds. System Configuration -------------------- Standard System 2 Controllers and 3 Computes LAB: WCP_71_75 Branch/Pull Time/Commit ----------------------- first failure BUILD_ID="2020-07-13_00-00-00" BUILD_DATE="2020-07-13 00:05:40 -0400" second failure BUILD_ID="2020-07-14_00-00-00" BUILD_DATE="2020-07-14 00:05:40 -0400" Last Pass --------- Build BUILD_ID="2020-07-14_00-00-00" force reboot of controller-0 saw expected behaviour. Following this system recovery, controller-1 was force rebooted and controller-0 did not reach ready status, some system and app pods were stuck in terminating/pending state, reproducing issue seen in the previous build. Timestamp/Logs -------------- Collect all logs/ describe unhealthy pods: https://files.starlingx.kube.cengn.ca/launchpad/1887438 Test Activity ------------- System Test Automation Development Workaround ---------- N/A
2020-07-15 20:14:20 Ghada Khalil tags stx.containers
2020-07-15 20:14:26 Ghada Khalil starlingx: status New Incomplete
2020-07-15 20:31:03 Andrew Vaillancourt description Brief Description ----------------- After force rebooting controller-1, controller-0 did not reach 'Ready' status. Severity -------- Major Steps to Reproduce ------------------ Force reboot active controller Expected Behavior ------------------ Upon rebooting active controller, standby controller takes over in ready state, the system pods, applications, any test pods are up and running. Actual Behavior ---------------- After force rebooting controller-1, controller-0 did not reach 'Ready' status. Following pods never reached healthy status: cert-manager cm-cert-manager-856678cfb7-mmbzn 0/1 Pending 0 4h23m cert-manager cm-cert-manager-cainjector-85849bd97-7trcg 0/1 Pending 0 4h22m cert-manager cm-cert-manager-webhook-5745478cbc-8k2m7 0/1 Pending 0 4h22m kube-system coredns-78d9fd7cb9-7bdw9 0/1 Pending 0 4h22m kube-system ic-nginx-ingress-default-backend-5ffcfd7744-zr4wj 0/1 Terminating 0 4h23m kube-system rbd-provisioner-77bfb6dbb-7pglp 0/1 Pending 0 4h22m Reproducibility --------------- Reproduced on same lab with 2 diff builds. System Configuration -------------------- Standard System 2 Controllers and 3 Computes LAB: WCP_71_75 Branch/Pull Time/Commit ----------------------- first failure BUILD_ID="2020-07-13_00-00-00" BUILD_DATE="2020-07-13 00:05:40 -0400" second failure BUILD_ID="2020-07-14_00-00-00" BUILD_DATE="2020-07-14 00:05:40 -0400" Last Pass --------- Build BUILD_ID="2020-07-14_00-00-00" force reboot of controller-0 saw expected behaviour. Following this system recovery, controller-1 was force rebooted and controller-0 did not reach ready status, some system and app pods were stuck in terminating/pending state, reproducing issue seen in the previous build. Timestamp/Logs -------------- Collect all logs/ describe unhealthy pods: https://files.starlingx.kube.cengn.ca/launchpad/1887438 Test Activity ------------- System Test Automation Development Workaround ---------- N/A Brief Description ----------------- After force rebooting controller-1, controller-0 did not reach 'Ready' status. Severity -------- Major Steps to Reproduce ------------------ Force reboot active controller Expected Behavior ------------------ Upon rebooting active controller, standby controller takes over in ready state, the system pods, applications, any test pods are up and running. Actual Behavior ---------------- After force rebooting controller-1, controller-0 did not reach 'Ready' status. controller-0:~$ kubectl get nodes NAME STATUS ROLES AGE VERSION compute-0 Ready <none> 8h v1.18.1 compute-1 Ready <none> 8h v1.18.1 compute-2 Ready <none> 8h v1.18.1 controller-0 NotReady master 9h v1.18.1 controller-1 Ready master 8h v1.18.1 Following pods never reached healthy status: cert-manager cm-cert-manager-856678cfb7-mmbzn 0/1 Pending 0 4h23m cert-manager cm-cert-manager-cainjector-85849bd97-7trcg 0/1 Pending 0 4h22m cert-manager cm-cert-manager-webhook-5745478cbc-8k2m7 0/1 Pending 0 4h22m kube-system coredns-78d9fd7cb9-7bdw9 0/1 Pending 0 4h22m kube-system ic-nginx-ingress-default-backend-5ffcfd7744-zr4wj 0/1 Terminating 0 4h23m kube-system rbd-provisioner-77bfb6dbb-7pglp 0/1 Pending 0 4h22m Reproducibility --------------- Reproduced on same lab with 2 diff builds. System Configuration -------------------- Standard System 2 Controllers and 3 Computes LAB: WCP_71_75 Branch/Pull Time/Commit ----------------------- first failure BUILD_ID="2020-07-13_00-00-00" BUILD_DATE="2020-07-13 00:05:40 -0400" second failure BUILD_ID="2020-07-14_00-00-00" BUILD_DATE="2020-07-14 00:05:40 -0400" Last Pass --------- Build BUILD_ID="2020-07-14_00-00-00" force reboot of controller-0 saw expected behaviour. Following this system recovery, controller-1 was force rebooted and controller-0 did not reach ready status, some system and app pods were stuck in terminating/pending state, reproducing issue seen in the previous build. Timestamp/Logs -------------- Collect all logs/ describe unhealthy pods: https://files.starlingx.kube.cengn.ca/launchpad/1887438 Test Activity ------------- System Test Automation Development Workaround ---------- N/A
2020-07-20 23:27:34 Chris Friesen bug watch added https://github.com/kubernetes/kubernetes/issues/93268
2020-07-20 23:27:34 Chris Friesen bug watch added https://github.com/golang/go/issues/40213
2020-07-20 23:27:42 Chris Friesen starlingx: status Incomplete Confirmed
2020-07-22 18:00:27 Ghada Khalil starlingx: importance Undecided Low
2020-07-22 18:02:02 Ghada Khalil tags stx.containers stx.5.0 stx.containers
2020-07-22 18:03:28 Ghada Khalil starlingx: importance Low Medium
2020-07-22 18:03:38 Ghada Khalil starlingx: assignee Frank Miller (sensfan22)
2020-07-22 18:20:54 Andrew Vaillancourt description Brief Description ----------------- After force rebooting controller-1, controller-0 did not reach 'Ready' status. Severity -------- Major Steps to Reproduce ------------------ Force reboot active controller Expected Behavior ------------------ Upon rebooting active controller, standby controller takes over in ready state, the system pods, applications, any test pods are up and running. Actual Behavior ---------------- After force rebooting controller-1, controller-0 did not reach 'Ready' status. controller-0:~$ kubectl get nodes NAME STATUS ROLES AGE VERSION compute-0 Ready <none> 8h v1.18.1 compute-1 Ready <none> 8h v1.18.1 compute-2 Ready <none> 8h v1.18.1 controller-0 NotReady master 9h v1.18.1 controller-1 Ready master 8h v1.18.1 Following pods never reached healthy status: cert-manager cm-cert-manager-856678cfb7-mmbzn 0/1 Pending 0 4h23m cert-manager cm-cert-manager-cainjector-85849bd97-7trcg 0/1 Pending 0 4h22m cert-manager cm-cert-manager-webhook-5745478cbc-8k2m7 0/1 Pending 0 4h22m kube-system coredns-78d9fd7cb9-7bdw9 0/1 Pending 0 4h22m kube-system ic-nginx-ingress-default-backend-5ffcfd7744-zr4wj 0/1 Terminating 0 4h23m kube-system rbd-provisioner-77bfb6dbb-7pglp 0/1 Pending 0 4h22m Reproducibility --------------- Reproduced on same lab with 2 diff builds. System Configuration -------------------- Standard System 2 Controllers and 3 Computes LAB: WCP_71_75 Branch/Pull Time/Commit ----------------------- first failure BUILD_ID="2020-07-13_00-00-00" BUILD_DATE="2020-07-13 00:05:40 -0400" second failure BUILD_ID="2020-07-14_00-00-00" BUILD_DATE="2020-07-14 00:05:40 -0400" Last Pass --------- Build BUILD_ID="2020-07-14_00-00-00" force reboot of controller-0 saw expected behaviour. Following this system recovery, controller-1 was force rebooted and controller-0 did not reach ready status, some system and app pods were stuck in terminating/pending state, reproducing issue seen in the previous build. Timestamp/Logs -------------- Collect all logs/ describe unhealthy pods: https://files.starlingx.kube.cengn.ca/launchpad/1887438 Test Activity ------------- System Test Automation Development Workaround ---------- N/A Brief Description ----------------- After force rebooting controller-1, controller-0 did not reach 'Ready' status. Severity -------- Major Steps to Reproduce ------------------ Force reboot active controller Expected Behavior ------------------ Upon rebooting active controller, standby controller takes over in ready state, the system pods, applications, any test pods are up and running. Actual Behavior ---------------- After force rebooting controller-1, controller-0 did not reach 'Ready' status. controller-0:~$ kubectl get nodes NAME STATUS ROLES AGE VERSION compute-0 Ready <none> 8h v1.18.1 compute-1 Ready <none> 8h v1.18.1 compute-2 Ready <none> 8h v1.18.1 controller-0 NotReady master 9h v1.18.1 controller-1 Ready master 8h v1.18.1 Following pods never reached healthy status: cert-manager cm-cert-manager-856678cfb7-mmbzn 0/1 Pending 0 4h23m cert-manager cm-cert-manager-cainjector-85849bd97-7trcg 0/1 Pending 0 4h22m cert-manager cm-cert-manager-webhook-5745478cbc-8k2m7 0/1 Pending 0 4h22m kube-system coredns-78d9fd7cb9-7bdw9 0/1 Pending 0 4h22m kube-system ic-nginx-ingress-default-backend-5ffcfd7744-zr4wj 0/1 Terminating 0 4h23m kube-system rbd-provisioner-77bfb6dbb-7pglp 0/1 Pending 0 4h22m Reproducibility --------------- Reproduced on same lab with 2 diff builds. System Configuration -------------------- Standard System 2 Controllers and 3 Computes LAB: WCP_71_75 Branch/Pull Time/Commit ----------------------- first failure BUILD_ID="2020-07-13_00-00-00" BUILD_DATE="2020-07-13 00:05:40 -0400" second failure BUILD_ID="2020-07-14_00-00-00" BUILD_DATE="2020-07-14 00:05:40 -0400" Last Pass --------- Build BUILD_ID="2020-07-14_00-00-00" force reboot of controller-0 saw expected behaviour. Following this system recovery, controller-1 was force rebooted and controller-0 did not reach ready status, some system and app pods were stuck in terminating/pending state, reproducing issue seen in the previous build. Timestamp/Logs -------------- Collect all logs/ describe unhealthy pods: https://files.starlingx.kube.cengn.ca/launchpad/1887438 Test Activity ------------- System Test Automation Development Workaround ---------- Possible workaround: From https://github.com/kubernetes/kubernetes/issues/93268: "... after all nodes were running again [...] restarting kubelet on the "NotReady" node was enough to make it go "Ready" again."
2021-03-26 21:36:58 Frank Miller starlingx: assignee Frank Miller (sensfan22) Chris Friesen (cbf123)
2021-04-15 21:49:59 Frank Miller tags stx.5.0 stx.containers stx.6.0 stx.containers
2021-06-02 18:12:55 OpenStack Infra tags stx.6.0 stx.containers in-f-centos8 stx.6.0 stx.containers
2021-06-02 18:12:55 OpenStack Infra cve linked 2019-10160
2021-06-02 18:12:55 OpenStack Infra cve linked 2019-16056
2021-06-02 18:12:55 OpenStack Infra cve linked 2019-9636
2021-06-02 18:12:55 OpenStack Infra cve linked 2019-9924
2021-06-02 18:12:55 OpenStack Infra cve linked 2019-9948
2021-06-07 17:19:00 OpenStack Infra bug watch added https://bugzilla.redhat.com/show_bug.cgi?id=1793527
2021-06-07 17:19:00 OpenStack Infra bug watch added https://bugzilla.redhat.com/show_bug.cgi?id=1819868
2021-06-07 17:19:00 OpenStack Infra cve linked 2018-15473
2021-06-07 17:19:00 OpenStack Infra cve linked 2019-18634
2021-06-07 17:19:00 OpenStack Infra cve linked 2019-6470
2021-06-07 17:19:00 OpenStack Infra cve linked 2020-13817
2021-06-07 17:19:00 OpenStack Infra cve linked 2020-15705
2021-06-07 17:19:00 OpenStack Infra cve linked 2020-15707
2021-06-07 17:19:00 OpenStack Infra cve linked 2021-3156
2021-11-29 16:07:50 Chris Friesen starlingx: status Confirmed Fix Released