OpenStack pods were not recovered after force reboot active controller

Bug #1855474 reported by Yosief Gebremariam on 2019-12-06
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
High
yong hu

Bug Description

Brief Description
-----------------
Many OpenStack pods fail to recover or were slow to recover after force rebooting the active controller

Severity
--------
Major

Steps to Reproduce
------------------
- Install and configure system, apply stx-openstack application
- 'sudo reboot -f' from active controller

Expected Behavior
------------------
- system swacts to the standby controller and all OpenStack pods recover to Running or Completed states.

Actual Behavior
----------------
- After force rebooting the controller, a number of OpenStack pods stuck in Init state. The keystone API and cinder-volume pods crushed.

controller-0:~$ kubectl get pods --all-namespaces | grep -v -e Completed -e Running
NAMESPACE NAME READY STATUS RESTARTS AGE
openstack cinder-api-59fd9c7c6f-86h2d 0/1 Init:0/2 0 3h
openstack cinder-volume-654bcb6569-lsjxt 0/1 Init:CrashLoopBackOff 22 3h
openstack fm-rest-api-78f97cc864-fqkhj 0/1 Init:0/1 0 3h
openstack glance-api-54777c6d45-gxrdc 0/1 Init:0/3 0 3h
openstack heat-api-69b8487b88-g4tc2 0/1 Init:0/1 0 3h
openstack heat-cfn-6b4b6b74f8-w7f78 0/1 Init:0/1 0 3h
openstack heat-engine-8458cf778f-xbbd4 0/1 Init:0/1 0 3h
openstack heat-engine-cleaner-1575645900-pd697 0/1 Init:0/1 0 178m
openstack horizon-5545469f58-j4bf6 0/1 Init:0/1 0 175m
openstack keystone-api-6c45dc9dbb-2v8h5 0/1 CrashLoopBackOff 43 3h39m
openstack keystone-api-6c45dc9dbb-pch72 0/1 Init:0/1 0 3h
openstack neutron-server-79c6fdf585-lwpb7 0/1 Init:0/1 0 3h
openstack nova-api-metadata-855ccf8fc4-mk446 0/1 Init:0/2 0 3h
openstack nova-api-osapi-58b7ffbf-zjv8l 0/1 Init:0/1 0 3h
openstack nova-conductor-6bbf89bf4c-7bhvg 0/1 Init:0/1 0 3h
openstack nova-novncproxy-58779744bd-szx4m 0/1 Init:0/3 0 3h
openstack nova-scheduler-67c986b5c8-rgt8x 0/1 Init:0/1 0 3h
openstack nova-service-cleaner-1575648000-kdln5 0/1 Init:0/1 0 143m

Reproducibility
---------------
Intermittent (2 out of 3)

System Configuration
--------------------
Multi-node system

Branch/Pull Time/Commit
-----------------------
r/stx.3.0 as of 2019-12-05 02:30:00

Timestamp/Logs
--------------
2019-12-06 15:21:50,338] 181 INFO MainThread host_helper.reboot_hosts:: Rebooting active controller: controller-0
[2019-12-06 15:21:50,338] 311 DEBUG MainThread ssh.send :: Send 'sudo reboot -f'

Yosief Gebremariam (ygebrema) wrote :
summary: - openstack pods were not recovered after force reboot active controller
+ OpenStack pods were not recovered after force reboot active controller
Ghada Khalil (gkhalil) wrote :

Assigning to the distro.openstack PL for triage and release recommendation -- keystone & cinder pods are not recovering.

description: updated
tags: added: stx.containers stx.distro.openstack
Changed in starlingx:
assignee: nobody → yong hu (yhu6)
description: updated
yong hu (yhu6) wrote :

@zhipeng, please analyze what went wrong after the active controller was rebooted and swacted.

Changed in starlingx:
assignee: yong hu (yhu6) → zhipeng liu (zhipengs)
yong hu (yhu6) on 2019-12-09
tags: added: stx.3.0
Changed in starlingx:
importance: Undecided → Medium
yong hu (yhu6) wrote :

I reproduced the issue on my environment that "mariadb-server" and "mariadb-ingress" were not running after rebooting active controller and switching to another controller. So, mariadb services as fundamental services, were blocking other OpenStack services.

Debugging this issue for the cause...

Peng Peng (ppeng) wrote :

Issue was reproduced on
Lab: WCP_76_77
Load: 20191210T000000Z

[2019-12-10 11:14:27,854] 311 DEBUG MainThread ssh.send :: Send 'kubectl get pod --field-selector=status.phase=Running -o=wide --all-namespaces | grep --color=never -v 1/1'
[2019-12-10 11:14:28,286] 433 DEBUG MainThread ssh.expect :: Output:
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
openstack horizon-778555f47b-srlcm 0/1 Running 0 2m3s 172.16.192.78 controller-0 <none> <none>
openstack keystone-api-787744798f-xc7gl 0/1 Running 5 151m 172.16.166.141 controller-1 <none> <none>

[2019-12-10 12:28:00,655] 311 DEBUG MainThread ssh.send :: Send 'kubectl get pod --field-selector=status.phase=Running -o=wide --all-namespaces | grep --color=never -v 1/1'
[2019-12-10 12:28:01,167] 433 DEBUG MainThread ssh.expect :: Output:
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
openstack keystone-api-787744798f-xc7gl 0/1 CrashLoopBackOff 19 3h44m 172.16.166.141 controller-1 <none> <none>

Elio Martinez (elio1979) wrote :

I'm trying to reproduce the issue with the following version and i'm not able to get it:

controller-1:~$ cat /etc/build.info
###
### StarlingX
### Release 19.12
###

OS="centos"
SW_VERSION="19.12"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="r/stx.3.0"

JOB="STX_BUILD_3.0"
<email address hidden>"
BUILD_NUMBER="10"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2019-12-05 02:30:00 +0000"

Cristopher Lemus (cjlemusc) wrote :

Reproduced on Standard (2+2) Baremetal with load from Dec 10th (BUILD_DATE="2019-12-10 00:00:00 +0000"). Full log attached.

Pods did not recovered after forcing the reboot of active controller.

http://paste.openstack.org/show/787408/

yong hu (yhu6) on 2019-12-11
Changed in starlingx:
assignee: zhipeng liu (zhipengs) → yong hu (yhu6)
Ghada Khalil (gkhalil) wrote :

As per community review on 12/11, the recommendation is to cherry-pick this fix to r/stx.3.0 for the first maintenance release. Therefore, changing the priority to High (only high priority bugs are cherrypicked after the final compile).

Changed in starlingx:
importance: Medium → High
status: New → Triaged
yong hu (yhu6) wrote :

The root-cause for this issue was dug out why mariadb-server (cluster) failed to recover after brutely rebooting the active controller.

Basically there are 2 mariadb-server instances in the mariadb cluster, and defined by "ReadinessProbe", they will cross-check the status with each other periodically (defined by ReadinessProbe). In the failed case, whenever one StarlingX controller, on which one mariadb-server is running, is rebooted, another mariadb (on another controller) will fail to sync with the destroyed (caused by reboot) one, and the "ReadinessProbe" failure will further lead the pod failure of itself. So, it is essentially a dead-lock, and eventually none of 2 mariadb-servers will come to live.

This issue was not from openstack-helm/mariadb upstream which actually hs 3 instances. With 3 mariadb server instances, one instance failures/death won't crash the cluster because there are still other 2 alive instances to "cross-check" with each other.
In StarlingX, we override the replica from 3 to 2 so that 2 mariadb servers can be placed on 2 controllers respectively. However this change (of only having 2 instances) brings the dead-lock as described above.

The solution is to disable the "ReadinessProbe", following the practice we took for Nova.

Changed in starlingx:
status: Triaged → In Progress
Peng Peng (ppeng) wrote :

Issue reproduced on
Lab: PV1
Load: 20191224T000000Z

[2019-12-24 09:46:15,497] 181 INFO MainThread host_helper.reboot_hosts:: Rebooting active controller: controller-0
[2019-12-24 09:46:15,497] 311 DEBUG MainThread ssh.send :: Send 'sudo reboot -f'

[2019-12-24 09:59:52,052] 311 DEBUG MainThread ssh.send :: Send 'kubectl get pod --field-selector=status.phase!=Running,status.phase!=Succeeded --all-namespaces -o=wide'
[2019-12-24 09:59:52,328] 433 DEBUG MainThread ssh.expect :: Output:
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
openstack cinder-api-7db9fcf6b6-fcnv2 0/1 Init:0/2 0 12m 172.16.192.117 controller-0 <none> <none>
openstack cinder-volume-777f8744b-49hdl 0/1 Init:3/4 1 12m 172.16.192.105 controller-0 <none> <none>
openstack fm-rest-api-84b8579d67-vlp79 0/1 Init:0/1 0 12m 172.16.192.95 controller-0 <none> <none>
openstack glance-api-56677499-qcbr6 0/1 Init:0/3 0 12m 172.16.192.107 controller-0 <none> <none>
openstack heat-api-67d68b5478-s8znp 0/1 Init:0/1 0 12m 172.16.192.110 controller-0 <none> <none>
openstack heat-cfn-676dc96c56-ztgrw 0/1 Init:0/1 0 12m 172.16.192.75 controller-0 <none> <none>
openstack heat-engine-6695466669-mlsqh 0/1 Init:0/1 0 12m 172.16.192.109 controller-0 <none> <none>
openstack heat-engine-cleaner-1577181000-d2ls4 0/1 Init:0/1 0 9m47s 172.16.166.183 controller-1 <none> <none>
openstack horizon-5b8c4fb977-fbq5w 0/1 Init:0/1 0 7m47s 172.16.192.74 controller-0 <none> <none>
openstack keystone-api-7bd7cb98d8-7l7sl 0/1 Init:0/1 0 12m 172.16.192.111 controller-0 <none> <none>
openstack neutron-server-7fb5cb6fb5-q7lfq 0/1 Init:0/1 0 12m 172.16.192.99 controller-0 <none> <none>
openstack nova-api-metadata-f7797f95c-8vfln 0/1 Init:0/2 0 12m 172.16.192.89 controller-0 <none> <none>
openstack nova-api-osapi-5c4fb5b84c-x5cds 0/1 Init:0/1 0 12m 172.16.192.123 controller-0 <none> <none>
openstack nova-conductor-5847bc6cc9-dfjfm 0/1 Init:0/1 0 12m 172.16.192.119 controller-0 <none> <none>
openstack nova-novncproxy-5f665c7d9d-m22t5 0/1 Init:0/3 0 12m 172.16.192.126 controller-0 <none> <none>
openstack nova-scheduler-674d9f7d94-kv7s9 0/1 Init:0/1 0 12m 172.16.192.108 controller-0 <none> <none>
controller-1:~$

Peng Peng (ppeng) wrote :

Issue was reproduced on
Lab: PV1
Load: 20200127T000002Z

Log @
https://files.starlingx.kube.cengn.ca/launchpad/1855474

zhipeng liu (zhipengs) wrote :

Hi pengpeng,

The root cause has already been found and the fix is ready.
We need to get related patches merged before retest it again.

Thanks!
Zhipeng

Ghada Khalil (gkhalil) wrote :

Hi Zhipeng,
This issue is seen regularly. Can you please work on addressing the comments in the review from Feb 11 so that this can be merged? Thanks.

Change abandoned by yong hu (<email address hidden>) on branch: master
Review: https://review.opendev.org/699537
Reason: drop this one and will make the changes in the manifest, according to Angie's comment.

Ghada Khalil (gkhalil) wrote :

Also adding the stx.4.0 tag as this is regularly with stx master

tags: added: stx.4.0
Yatindra (yatindra) wrote :

Hi,
I think I am also getting affected by this issue. I use stx 3.0 release and have done reboot from the active controller sometimes ago and fails to apply stx-openstack.

I get as ouput of $ kubectl describe pod mariadb-server-0 -n openstack
--
Events:
  Type Reason Age From Message
  ---- ------ ---- ---- -------
  Warning Unhealthy 2m7s (x8251 over 2d20h) kubelet, controller-1 Readiness probe failed:

Is there possibility to get patch for it for release stx 3.0 or fix will only be availiable in stx 4.0.

Yatindra (yatindra) wrote :

Sysinv logs for more info

yong hu (yhu6) wrote :

@zhipeng has a patch under review.
@Yatindra, just to double confirm, please check the status of mariadb pod, which should be the root cause as we figured out.

Yatindra (yatindra) wrote :

@yong

I have shared as above the output of mariadb-server-0 pod which has readiness prove. From which it seems to be rootcause as logs shows, mariadb-server-0 not ready.

Events:
  Type Reason Age From Message
  ---- ------ ---- ---- -------
  Warning Unhealthy 2m7s (x8251 over 2d20h) kubelet, controller-1 Readiness probe failed:

yong hu (yhu6) wrote :

yes. it was the root cause.
See the commit message in this patch: https://review.opendev.org/#/c/699532/

Reviewed: https://review.opendev.org/699532
Committed: https://git.openstack.org/cgit/starlingx/openstack-armada-app/commit/?id=6538342c26c479988fbf8515260ee76e2809bc08
Submitter: Zuul
Branch: master

commit 6538342c26c479988fbf8515260ee76e2809bc08
Author: Hu, Yong <email address hidden>
Date: Wed Dec 18 02:23:44 2019 +0000

    Update mariadb chart to enable probe overrides

    Adding probes parameters for armada overriding them in duplex AIO and
    multi-node deployment. Specifically, there are 2 mariadb-servers in
    the DB cluster for OpenStack services at duplex or multi-node cases.
    These 2 mariadb-server pods are placed on Controller-0 and Controller-1
    respectively (manipulated by anti-affinity). Whenever one Controller is
    rebooted on purpose or even worse accidiently shutdown for any reasons
    mariadb-server pod on that controller is gone together. To keep mariadb
    cluster still working even with only one instance, we have to adjust
    the default probe behaviors. Upon this request, we have to export probe
    parameters for "startupProbe" and "readinessProbe" so that StarlingX
    Armada application could set these parameters accordingly and thereby
    mariadb server can still work as expected with even only one pod in the
    cases of Controller node rebooting or shutdown.

    Closes-bug: 1855474

    Change-Id: I3a8a99edd44d7ac4257ddf79b6baba5c52714324
    Signed-off-by: Hu, Yong <email address hidden>
    Co-Authored-By: Zhipeng, Liu <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil) wrote :

Yong, This LP is marked as gating for stx.3.0. Please cherry-pick the code changes to the stx.3.0 branch if applicable or add a note explaining why it shouldn't be cherry-picked.

Bill Zvonar (billzvonar) wrote :

Yong - reminder: This LP is marked as gating for stx.3.0. Please cherry-pick the code changes to the stx.3.0 branch if applicable or add a note explaining why it shouldn't be cherry-picked.

tags: added: stx.cherrypickneeded

Reviewed: https://review.opendev.org/747093
Committed: https://git.openstack.org/cgit/starlingx/openstack-armada-app/commit/?id=6afdaf7221a39fdceb15dbdd4dc8c309fc804d11
Submitter: Zuul
Branch: r/stx.3.0

commit 6afdaf7221a39fdceb15dbdd4dc8c309fc804d11
Author: Hu, Yong <email address hidden>
Date: Wed Dec 18 02:23:44 2019 +0000

    Update mariadb chart to enable probe overrides

    Adding probes parameters for armada overriding them in duplex AIO and
    multi-node deployment. Specifically, there are 2 mariadb-servers in
    the DB cluster for OpenStack services at duplex or multi-node cases.
    These 2 mariadb-server pods are placed on Controller-0 and Controller-1
    respectively (manipulated by anti-affinity). Whenever one Controller is
    rebooted on purpose or even worse accidiently shutdown for any reasons
    mariadb-server pod on that controller is gone together. To keep mariadb
    cluster still working even with only one instance, we have to adjust
    the default probe behaviors. Upon this request, we have to export probe
    parameters for "startupProbe" and "readinessProbe" so that StarlingX
    Armada application could set these parameters accordingly and thereby
    mariadb server can still work as expected with even only one pod in the
    cases of Controller node rebooting or shutdown.

    Closes-bug: 1855474

    Change-Id: I3a8a99edd44d7ac4257ddf79b6baba5c52714324
    Signed-off-by: Hu, Yong <email address hidden>
    Co-Authored-By: Zhipeng, Liu <email address hidden>

Bill Zvonar (billzvonar) on 2020-08-27
tags: removed: stx.cherrypickneeded
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers