OpenStack pods were not recovered after force reboot active controller

Bug #1855474 reported by Yosief Gebremariam on 2019-12-06
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
High
yong hu

Bug Description

Brief Description
-----------------
Many OpenStack pods fail to recover or were slow to recover after force rebooting the active controller

Severity
--------
Major

Steps to Reproduce
------------------
- Install and configure system, apply stx-openstack application
- 'sudo reboot -f' from active controller

Expected Behavior
------------------
- system swacts to the standby controller and all OpenStack pods recover to Running or Completed states.

Actual Behavior
----------------
- After force rebooting the controller, a number of OpenStack pods stuck in Init state. The keystone API and cinder-volume pods crushed.

controller-0:~$ kubectl get pods --all-namespaces | grep -v -e Completed -e Running
NAMESPACE NAME READY STATUS RESTARTS AGE
openstack cinder-api-59fd9c7c6f-86h2d 0/1 Init:0/2 0 3h
openstack cinder-volume-654bcb6569-lsjxt 0/1 Init:CrashLoopBackOff 22 3h
openstack fm-rest-api-78f97cc864-fqkhj 0/1 Init:0/1 0 3h
openstack glance-api-54777c6d45-gxrdc 0/1 Init:0/3 0 3h
openstack heat-api-69b8487b88-g4tc2 0/1 Init:0/1 0 3h
openstack heat-cfn-6b4b6b74f8-w7f78 0/1 Init:0/1 0 3h
openstack heat-engine-8458cf778f-xbbd4 0/1 Init:0/1 0 3h
openstack heat-engine-cleaner-1575645900-pd697 0/1 Init:0/1 0 178m
openstack horizon-5545469f58-j4bf6 0/1 Init:0/1 0 175m
openstack keystone-api-6c45dc9dbb-2v8h5 0/1 CrashLoopBackOff 43 3h39m
openstack keystone-api-6c45dc9dbb-pch72 0/1 Init:0/1 0 3h
openstack neutron-server-79c6fdf585-lwpb7 0/1 Init:0/1 0 3h
openstack nova-api-metadata-855ccf8fc4-mk446 0/1 Init:0/2 0 3h
openstack nova-api-osapi-58b7ffbf-zjv8l 0/1 Init:0/1 0 3h
openstack nova-conductor-6bbf89bf4c-7bhvg 0/1 Init:0/1 0 3h
openstack nova-novncproxy-58779744bd-szx4m 0/1 Init:0/3 0 3h
openstack nova-scheduler-67c986b5c8-rgt8x 0/1 Init:0/1 0 3h
openstack nova-service-cleaner-1575648000-kdln5 0/1 Init:0/1 0 143m

Reproducibility
---------------
Intermittent (2 out of 3)

System Configuration
--------------------
Multi-node system

Branch/Pull Time/Commit
-----------------------
r/stx.3.0 as of 2019-12-05 02:30:00

Timestamp/Logs
--------------
2019-12-06 15:21:50,338] 181 INFO MainThread host_helper.reboot_hosts:: Rebooting active controller: controller-0
[2019-12-06 15:21:50,338] 311 DEBUG MainThread ssh.send :: Send 'sudo reboot -f'

Yosief Gebremariam (ygebrema) wrote :
summary: - openstack pods were not recovered after force reboot active controller
+ OpenStack pods were not recovered after force reboot active controller
Ghada Khalil (gkhalil) wrote :

Assigning to the distro.openstack PL for triage and release recommendation -- keystone & cinder pods are not recovering.

description: updated
tags: added: stx.containers stx.distro.openstack
Changed in starlingx:
assignee: nobody → yong hu (yhu6)
description: updated
yong hu (yhu6) wrote :

@zhipeng, please analyze what went wrong after the active controller was rebooted and swacted.

Changed in starlingx:
assignee: yong hu (yhu6) → zhipeng liu (zhipengs)
yong hu (yhu6) on 2019-12-09
tags: added: stx.3.0
Changed in starlingx:
importance: Undecided → Medium
yong hu (yhu6) wrote :

I reproduced the issue on my environment that "mariadb-server" and "mariadb-ingress" were not running after rebooting active controller and switching to another controller. So, mariadb services as fundamental services, were blocking other OpenStack services.

Debugging this issue for the cause...

Peng Peng (ppeng) wrote :

Issue was reproduced on
Lab: WCP_76_77
Load: 20191210T000000Z

[2019-12-10 11:14:27,854] 311 DEBUG MainThread ssh.send :: Send 'kubectl get pod --field-selector=status.phase=Running -o=wide --all-namespaces | grep --color=never -v 1/1'
[2019-12-10 11:14:28,286] 433 DEBUG MainThread ssh.expect :: Output:
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
openstack horizon-778555f47b-srlcm 0/1 Running 0 2m3s 172.16.192.78 controller-0 <none> <none>
openstack keystone-api-787744798f-xc7gl 0/1 Running 5 151m 172.16.166.141 controller-1 <none> <none>

[2019-12-10 12:28:00,655] 311 DEBUG MainThread ssh.send :: Send 'kubectl get pod --field-selector=status.phase=Running -o=wide --all-namespaces | grep --color=never -v 1/1'
[2019-12-10 12:28:01,167] 433 DEBUG MainThread ssh.expect :: Output:
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
openstack keystone-api-787744798f-xc7gl 0/1 CrashLoopBackOff 19 3h44m 172.16.166.141 controller-1 <none> <none>

Elio Martinez (elio1979) wrote :

I'm trying to reproduce the issue with the following version and i'm not able to get it:

controller-1:~$ cat /etc/build.info
###
### StarlingX
### Release 19.12
###

OS="centos"
SW_VERSION="19.12"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="r/stx.3.0"

JOB="STX_BUILD_3.0"
<email address hidden>"
BUILD_NUMBER="10"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2019-12-05 02:30:00 +0000"

Cristopher Lemus (cjlemusc) wrote :

Reproduced on Standard (2+2) Baremetal with load from Dec 10th (BUILD_DATE="2019-12-10 00:00:00 +0000"). Full log attached.

Pods did not recovered after forcing the reboot of active controller.

http://paste.openstack.org/show/787408/

yong hu (yhu6) on 2019-12-11
Changed in starlingx:
assignee: zhipeng liu (zhipengs) → yong hu (yhu6)
Ghada Khalil (gkhalil) wrote :

As per community review on 12/11, the recommendation is to cherry-pick this fix to r/stx.3.0 for the first maintenance release. Therefore, changing the priority to High (only high priority bugs are cherrypicked after the final compile).

Changed in starlingx:
importance: Medium → High
status: New → Triaged
yong hu (yhu6) wrote :

The root-cause for this issue was dug out why mariadb-server (cluster) failed to recover after brutely rebooting the active controller.

Basically there are 2 mariadb-server instances in the mariadb cluster, and defined by "ReadinessProbe", they will cross-check the status with each other periodically (defined by ReadinessProbe). In the failed case, whenever one StarlingX controller, on which one mariadb-server is running, is rebooted, another mariadb (on another controller) will fail to sync with the destroyed (caused by reboot) one, and the "ReadinessProbe" failure will further lead the pod failure of itself. So, it is essentially a dead-lock, and eventually none of 2 mariadb-servers will come to live.

This issue was not from openstack-helm/mariadb upstream which actually hs 3 instances. With 3 mariadb server instances, one instance failures/death won't crash the cluster because there are still other 2 alive instances to "cross-check" with each other.
In StarlingX, we override the replica from 3 to 2 so that 2 mariadb servers can be placed on 2 controllers respectively. However this change (of only having 2 instances) brings the dead-lock as described above.

The solution is to disable the "ReadinessProbe", following the practice we took for Nova.

Changed in starlingx:
status: Triaged → In Progress
Peng Peng (ppeng) wrote :

Issue reproduced on
Lab: PV1
Load: 20191224T000000Z

[2019-12-24 09:46:15,497] 181 INFO MainThread host_helper.reboot_hosts:: Rebooting active controller: controller-0
[2019-12-24 09:46:15,497] 311 DEBUG MainThread ssh.send :: Send 'sudo reboot -f'

[2019-12-24 09:59:52,052] 311 DEBUG MainThread ssh.send :: Send 'kubectl get pod --field-selector=status.phase!=Running,status.phase!=Succeeded --all-namespaces -o=wide'
[2019-12-24 09:59:52,328] 433 DEBUG MainThread ssh.expect :: Output:
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
openstack cinder-api-7db9fcf6b6-fcnv2 0/1 Init:0/2 0 12m 172.16.192.117 controller-0 <none> <none>
openstack cinder-volume-777f8744b-49hdl 0/1 Init:3/4 1 12m 172.16.192.105 controller-0 <none> <none>
openstack fm-rest-api-84b8579d67-vlp79 0/1 Init:0/1 0 12m 172.16.192.95 controller-0 <none> <none>
openstack glance-api-56677499-qcbr6 0/1 Init:0/3 0 12m 172.16.192.107 controller-0 <none> <none>
openstack heat-api-67d68b5478-s8znp 0/1 Init:0/1 0 12m 172.16.192.110 controller-0 <none> <none>
openstack heat-cfn-676dc96c56-ztgrw 0/1 Init:0/1 0 12m 172.16.192.75 controller-0 <none> <none>
openstack heat-engine-6695466669-mlsqh 0/1 Init:0/1 0 12m 172.16.192.109 controller-0 <none> <none>
openstack heat-engine-cleaner-1577181000-d2ls4 0/1 Init:0/1 0 9m47s 172.16.166.183 controller-1 <none> <none>
openstack horizon-5b8c4fb977-fbq5w 0/1 Init:0/1 0 7m47s 172.16.192.74 controller-0 <none> <none>
openstack keystone-api-7bd7cb98d8-7l7sl 0/1 Init:0/1 0 12m 172.16.192.111 controller-0 <none> <none>
openstack neutron-server-7fb5cb6fb5-q7lfq 0/1 Init:0/1 0 12m 172.16.192.99 controller-0 <none> <none>
openstack nova-api-metadata-f7797f95c-8vfln 0/1 Init:0/2 0 12m 172.16.192.89 controller-0 <none> <none>
openstack nova-api-osapi-5c4fb5b84c-x5cds 0/1 Init:0/1 0 12m 172.16.192.123 controller-0 <none> <none>
openstack nova-conductor-5847bc6cc9-dfjfm 0/1 Init:0/1 0 12m 172.16.192.119 controller-0 <none> <none>
openstack nova-novncproxy-5f665c7d9d-m22t5 0/1 Init:0/3 0 12m 172.16.192.126 controller-0 <none> <none>
openstack nova-scheduler-674d9f7d94-kv7s9 0/1 Init:0/1 0 12m 172.16.192.108 controller-0 <none> <none>
controller-1:~$

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers