StarlingX

OpenStack pods were not recovered after force reboot active controller

Bug #1855474 reported by Yosief Gebremariam on 2019-12-06

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	High	yong hu

Bug Description

Brief Description
-----------------
Many OpenStack pods fail to recover or were slow to recover after force rebooting the active controller

Severity
--------
Major

Steps to Reproduce
------------------
- Install and configure system, apply stx-openstack application
- 'sudo reboot -f' from active controller

Expected Behavior
------------------
- system swacts to the standby controller and all OpenStack pods recover to Running or Completed states.

Actual Behavior
----------------
- After force rebooting the controller, a number of OpenStack pods stuck in Init state. The keystone API and cinder-volume pods crushed.

controller-0:~$ kubectl get pods --all-namespaces NAMESPACE NAME READY STATUS RESTARTS openstack cinder-api-59fd9c7c6f-86h2d openstack cinder-volume-654bcb6569-lsjxt openstack fm-rest-api-78f97cc864-fqkhj openstack glance-api-54777c6d45-gxrdc openstack heat-api-69b8487b88-g4tc2 openstack heat-cfn-6b4b6b74f8-w7f78 openstack heat-engine-8458cf778f-xbbd4 openstack heat-engine-cleaner-1575645900-5545469f58-j4bf6 openstack keystone-api-6c45dc9dbb-2v8h5 openstack keystone-api-6c45dc9dbb-pch72 openstack neutron-server-79c6fdf585-lwpb7 openstack nova-api-metadata-855ccf8fc4-mk446 openstack nova-api-osapi-58b7ffbf-zjv8l openstack nova-conductor-6bbf89bf4c-7bhvg openstack nova-novncproxy-58779744bd-szx4m openstack nova-scheduler-67c986b5c8-rgt8x openstack nova-service-cleaner-1575648000- AGE
0/1 Init:0/2 0 3h
0/1 Init:CrashLoopBackOff 22 3h
0/1 Init:0/1 0 3h
0/1 Init:0/3 0 3h
0/1 Init:0/1 0 3h
0/1 Init:0/1 0 3h
0/1 Init:0/1 0 3h
/>pd697 0/1 Init:0/1 0 178m
0/1 Init:0/1 0 175m
0/1 CrashLoopBackOff 43 3h39m
0/1 Init:0/1 0 3h
0/1 Init:0/1 0 3h
0/1 Init:0/2 0 3h
0/1 Init:0/1 0 3h
0/1 Init:0/1 0 3h
0/1 Init:0/3 0 3h
0/1 Init:0/1 0 3h
/>kdln5 0/1 Init:0/1 0 143m

Reproducibility
---------------
Intermittent (2 out of 3)

System Configuration
--------------------
Multi-node system

Branch/Pull Time/Commit
-----------------------
r/stx.3.0 as of 2019-12-05 02:30:00

Timestamp/Logs
--------------
2019-12-06 15:21:50,338] 181 INFO MainThread host_helper.reboot_hosts:: Rebooting active controller: controller-0
[2019-12-06 15:21:50,338] 311 DEBUG MainThread ssh.send :: Send 'sudo reboot -f'

See original description

Tags:

Revision history for this message

Yosief Gebremariam (ygebrema) wrote on 2019-12-06:

ALL_NODES_20191206.173815.tar Edit (110.6 MiB, application/x-tar)

summary:

- openstack pods were not recovered after force reboot active controller
+ OpenStack pods were not recovered after force reboot active controller

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-12-06:

Assigning to the distro.openstack PL for triage and release recommendation -- keystone & cinder pods are not recovering.

description:	updated
tags:	added: stx.containers stx.distro.openstack
Changed in starlingx:
assignee:	nobody → yong hu (yhu6)

Yosief Gebremariam (ygebrema) on 2019-12-06

description:

updated

Revision history for this message

yong hu (yhu6) wrote on 2019-12-09:

@zhipeng, please analyze what went wrong after the active controller was rebooted and swacted.

Changed in starlingx:
assignee:	yong hu (yhu6) → zhipeng liu (zhipengs)

yong hu (yhu6) on 2019-12-09

tags:	added: stx.3.0
Changed in starlingx:
importance:	Undecided → Medium

Revision history for this message

yong hu (yhu6) wrote on 2019-12-09:

I reproduced the issue on my environment that "mariadb-server" and "mariadb-ingress" were not running after rebooting active controller and switching to another controller. So, mariadb services as fundamental services, were blocking other OpenStack services.

Debugging this issue for the cause...

Revision history for this message

Peng Peng (ppeng) wrote on 2019-12-10:

ALL_NODES_20191210.144348.tar Edit (51.9 MiB, application/x-tar)

Issue was reproduced on
Lab: WCP_76_77
Load: 20191210T000000Z

[2019-12-10 11:14:27,854] 311 DEBUG MainThread ssh.send :: Send 'kubectl get pod --field-selector=status.phase=Running -o=wide --all-namespaces | grep --color=never -v 1/1'
[2019-12-10 11:14:28,286] 433 DEBUG MainThread ssh.expect :: Output:
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
openstack horizon-778555f47b-srlcm 0/1 Running 0 2m3s 172.16.192.78 controller-0 <none> <none>
openstack keystone-api-787744798f-xc7gl 0/1 Running 5 151m 172.16.166.141 controller-1 <none> <none>

[2019-12-10 12:28:00,655] 311 DEBUG MainThread ssh.send :: Send 'kubectl get pod --field-selector=status.phase=Running -o=wide --all-namespaces | grep --color=never -v 1/1'
[2019-12-10 12:28:01,167] 433 DEBUG MainThread ssh.expect :: Output:
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
openstack keystone-api-787744798f-xc7gl 0/1 CrashLoopBackOff 19 3h44m 172.16.166.141 controller-1 <none> <none>

Revision history for this message

Elio Martinez (elio1979) wrote on 2019-12-10:

I'm trying to reproduce the issue with the following version and i'm not able to get it:

controller-1:~$ cat /etc/build.info
###
### StarlingX
### Release 19.12
###

OS="centos"
SW_VERSION="19.12"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="r/stx.3.0"

JOB="STX_BUILD_3.0"
<email address hidden>"
BUILD_NUMBER="10"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2019-12-05 02:30:00 +0000"

Revision history for this message

Cristopher Lemus (cjlemusc) wrote on 2019-12-11:

ALL_NODES_20191210.184122.tar Edit (85.0 MiB, application/x-tar)

Reproduced on Standard (2+2) Baremetal with load from Dec 10th (BUILD_DATE="2019-12-10 00:00:00 +0000"). Full log attached.

Pods did not recovered after forcing the reboot of active controller.

http://paste.openstack.org/show/787408/

yong hu (yhu6) on 2019-12-11

Changed in starlingx:
assignee:	zhipeng liu (zhipengs) → yong hu (yhu6)

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-12-11:

As per community review on 12/11, the recommendation is to cherry-pick this fix to r/stx.3.0 for the first maintenance release. Therefore, changing the priority to High (only high priority bugs are cherrypicked after the final compile).

Changed in starlingx:
importance:	Medium → High
status:	New → Triaged

Revision history for this message

yong hu (yhu6) wrote on 2019-12-16:

The root-cause for this issue was dug out why mariadb-server (cluster) failed to recover after brutely rebooting the active controller.

Basically there are 2 mariadb-server instances in the mariadb cluster, and defined by "ReadinessProbe", they will cross-check the status with each other periodically (defined by ReadinessProbe). In the failed case, whenever one StarlingX controller, on which one mariadb-server is running, is rebooted, another mariadb (on another controller) will fail to sync with the destroyed (caused by reboot) one, and the "ReadinessProbe" failure will further lead the pod failure of itself. So, it is essentially a dead-lock, and eventually none of 2 mariadb-servers will come to live.

This issue was not from openstack-helm/mariadb upstream which actually hs 3 instances. With 3 mariadb server instances, one instance failures/death won't crash the cluster because there are still other 2 alive instances to "cross-check" with each other.
In StarlingX, we override the replica from 3 to 2 so that 2 mariadb servers can be placed on 2 controllers respectively. However this change (of only having 2 instances) brings the dead-lock as described above.

The solution is to disable the "ReadinessProbe", following the practice we took for Nova.

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-12-18: Fix proposed to openstack-armada-app (master)

#10

Fix proposed to branch: master
Review: https://review.opendev.org/699532

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-12-18: Fix proposed to config (master)

#11

Fix proposed to branch: master
Review: https://review.opendev.org/699537

Revision history for this message

Peng Peng (ppeng) wrote on 2019-12-24:

#12

ALL_NODES_20191224.145835.tar Edit (203.9 MiB, application/x-tar)

Issue reproduced on
Lab: PV1
Load: 20191224T000000Z

[2019-12-24 09:46:15,497] 181 INFO MainThread host_helper.reboot_hosts:: Rebooting active controller: controller-0
[2019-12-24 09:46:15,497] 311 DEBUG MainThread ssh.send :: Send 'sudo reboot -f'

[2019-12-24 09:59:52,052] 311 DEBUG MainThread ssh.send :: Send 'kubectl get pod --field-selector=status.phase!=Running,status.phase!=Succeeded --all-namespaces -o=wide'
[2019-12-24 09:59:52,328] 433 DEBUG MainThread ssh.expect :: Output:
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
openstack cinder-api-7db9fcf6b6-fcnv2 0/1 Init:0/2 0 12m 172.16.192.117 controller-0 <none> <none>
openstack cinder-volume-777f8744b-49hdl 0/1 Init:3/4 1 12m 172.16.192.105 controller-0 <none> <none>
openstack fm-rest-api-84b8579d67-vlp79 0/1 Init:0/1 0 12m 172.16.192.95 controller-0 <none> <none>
openstack glance-api-56677499-qcbr6 0/1 Init:0/3 0 12m 172.16.192.107 controller-0 <none> <none>
openstack heat-api-67d68b5478-s8znp 0/1 Init:0/1 0 12m 172.16.192.110 controller-0 <none> <none>
openstack heat-cfn-676dc96c56-ztgrw 0/1 Init:0/1 0 12m 172.16.192.75 controller-0 <none> <none>
openstack heat-engine-6695466669-mlsqh 0/1 Init:0/1 0 12m 172.16.192.109 controller-0 <none> <none>
openstack heat-engine-cleaner-1577181000-d2ls4 0/1 Init:0/1 0 9m47s 172.16.166.183 controller-1 <none> <none>
openstack horizon-5b8c4fb977-fbq5w 0/1 Init:0/1 0 7m47s 172.16.192.74 controller-0 <none> <none>
openstack keystone-api-7bd7cb98d8-7l7sl 0/1 Init:0/1 0 12m 172.16.192.111 controller-0 <none> <none>
openstack neutron-server-7fb5cb6fb5-q7lfq 0/1 Init:0/1 0 12m 172.16.192.99 controller-0 <none> <none>
openstack nova-api-metadata-f7797f95c-8vfln 0/1 Init:0/2 0 12m 172.16.192.89 controller-0 <none> <none>
openstack nova-api-osapi-5c4fb5b84c-x5cds 0/1 Init:0/1 0 12m 172.16.192.123 controller-0 <none> <none>
openstack nova-conductor-5847bc6cc9-dfjfm 0/1 Init:0/1 0 12m 172.16.192.119 controller-0 <none> <none>
openstack nova-novncproxy-5f665c7d9d-m22t5 0/1 Init:0/3 0 12m 172.16.192.126 controller-0 <none> <none>
openstack nova-scheduler-674d9f7d94-kv7s9 0/1 Init:0/1 0 12m 172.16.192.108 controller-0 <none> <none>
controller-1:~$

Issue reproduced on
Lab: PV1
Load: 20191224T000000Z

[2019-12-24 09:46:15,497] 181  INFO  MainThread host_helper.reboot_hosts:: Rebooting active controller: controller-0
[2019-12-24 09:46:15,497] 311  DEBUG MainThread ssh.send    :: Send 'sudo reboot -f'

[2019-12-24 09:59:52,052] 311  DEBUG MainThread ssh.send    :: Send 'kubectl get pod --field-selector=status.phase!=Running,status.phase!=Succeeded --all-namespaces -o=wide'
[2019-12-24 09:59:52,328] 433  DEBUG MainThread ssh.expect  :: Output: 
NAMESPACE   NAME                                   READY   STATUS     RESTARTS   AGE     IP               NODE           NOMINATED NODE   READINESS GATES
openstack   cinder-api-7db9fcf6b6-fcnv2            0/1     Init:0/2   0          12m     172.16.192.117   controller-0   <none>           <none>
openstack   cinder-volume-777f8744b-49hdl          0/1     Init:3/4   1          12m     172.16.192.105   controller-0   <none>           <none>
openstack   fm-rest-api-84b8579d67-vlp79           0/1     Init:0/1   0          12m     172.16.192.95    controller-0   <none>           <none>
openstack   glance-api-56677499-qcbr6              0/1     Init:0/3   0          12m     172.16.192.107   controller-0   <none>           <none>
openstack   heat-api-67d68b5478-s8znp              0/1     Init:0/1   0          12m     172.16.192.110   controller-0   <none>           <none>
openstack   heat-cfn-676dc96c56-ztgrw              0/1     Init:0/1   0          12m     172.16.192.75    controller-0   <none>           <none>
openstack   heat-engine-6695466669-mlsqh           0/1     Init:0/1   0          12m     172.16.192.109   controller-0   <none>           <none>
openstack   heat-engine-cleaner-1577181000-d2ls4   0/1     Init:0/1   0          9m47s   172.16.166.183   controller-1   <none>           <none>
openstack   horizon-5b8c4fb977-fbq5w               0/1     Init:0/1   0          7m47s   172.16.192.74    controller-0   <none>           <none>
openstack   keystone-api-7bd7cb98d8-7l7sl          0/1     Init:0/1   0          12m     172.16.192.111   controller-0   <none>           <none>
openstack   neutron-server-7fb5cb6fb5-q7lfq        0/1     Init:0/1   0          12m     172.16.192.99    controller-0   <none>           <none>
openstack   nova-api-metadata-f7797f95c-8vfln      0/1     Init:0/2   0          12m     172.16.192.89    controller-0   <none>           <none>
openstack   nova-api-osapi-5c4fb5b84c-x5cds        0/1     Init:0/1   0          12m     172.16.192.123   controller-0   <none>           <none>
openstack   nova-conductor-5847bc6cc9-dfjfm        0/1     Init:0/1   0          12m     172.16.192.119   controller-0   <none>           <none>
openstack   nova-novncproxy-5f665c7d9d-m22t5       0/1     Init:0/3   0          12m     172.16.192.126   controller-0   <none>           <none>
openstack   nova-scheduler-674d9f7d94-kv7s9        0/1     Init:0/1   0          12m     172.16.192.108   controller-0   <none>           <none>
controller-1:~$

Revision history for this message

Peng Peng (ppeng) wrote on 2020-02-05:

#13

Issue was reproduced on
Lab: PV1
Load: 20200127T000002Z

Log @
https://files.starlingx.kube.cengn.ca/launchpad/1855474

Revision history for this message

zhipeng liu (zhipengs) wrote on 2020-02-12:

#14

Hi pengpeng,

The root cause has already been found and the fix is ready.
We need to get related patches merged before retest it again.

Thanks!
Zhipeng

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2020-02-19:

#15

Hi Zhipeng,
This issue is seen regularly. Can you please work on addressing the comments in the review from Feb 11 so that this can be merged? Thanks.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-02-27: Change abandoned on config (master)

#16

Change abandoned by yong hu (<email address hidden>) on branch: master
Review: https://review.opendev.org/699537
Reason: drop this one and will make the changes in the manifest, according to Angie's comment.

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2020-02-28:

#17

Also adding the stx.4.0 tag as this is regularly with stx master

tags:

added: stx.4.0

Revision history for this message

Yatindra (yatindra) wrote on 2020-03-26:

#18

Hi,
I think I am also getting affected by this issue. I use stx 3.0 release and have done reboot from the active controller sometimes ago and fails to apply stx-openstack.

I get as ouput of $ kubectl describe pod mariadb-server-0 -n openstack
--
Events:
  Type Reason Age From Message
  ---- ------ ---- ---- -------
  Warning Unhealthy 2m7s (x8251 over 2d20h) kubelet, controller-1 Readiness probe failed:

Is there possibility to get patch for it for release stx 3.0 or fix will only be availiable in stx 4.0.

Revision history for this message

Yatindra (yatindra) wrote on 2020-03-26:

#19

stx-openstack-apply_2020-03-25-16-44-41.log Edit (52.8 KiB, text/plain)

Sysinv logs for more info

Revision history for this message

yong hu (yhu6) wrote on 2020-03-26:

#20

@zhipeng has a patch under review.
@Yatindra, just to double confirm, please check the status of mariadb pod, which should be the root cause as we figured out.

Revision history for this message

Yatindra (yatindra) wrote on 2020-03-26:

#21

@yong

I have shared as above the output of mariadb-server-0 pod which has readiness prove. From which it seems to be rootcause as logs shows, mariadb-server-0 not ready.

Events:
  Type Reason Age From Message
  ---- ------ ---- ---- -------
  Warning Unhealthy 2m7s (x8251 over 2d20h) kubelet, controller-1 Readiness probe failed:

Revision history for this message

yong hu (yhu6) wrote on 2020-03-26:

#22

yes. it was the root cause.
See the commit message in this patch: https://review.opendev.org/#/c/699532/

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-04-21: Fix merged to openstack-armada-app (master)

#23

Reviewed: https://review.opendev.org/699532
Committed: https://git.openstack.org/cgit/starlingx/openstack-armada-app/commit/?id=6538342c26c479988fbf8515260ee76e2809bc08
Submitter: Zuul
Branch: master

commit 6538342c26c479988fbf8515260ee76e2809bc08
Author: Hu, Yong <email address hidden>
Date: Wed Dec 18 02:23:44 2019 +0000

Update mariadb chart to enable probe overrides

    Adding probes parameters for armada overriding them in duplex AIO and
    multi-node deployment. Specifically, there are 2 mariadb-servers in
    the DB cluster for OpenStack services at duplex or multi-node cases.
    These 2 mariadb-server pods are placed on Controller-0 and Controller-1
    respectively (manipulated by anti-affinity). Whenever one Controller is
    rebooted on purpose or even worse accidiently shutdown for any reasons
    mariadb-server pod on that controller is gone together. To keep mariadb
    cluster still working even with only one instance, we have to adjust
    the default probe behaviors. Upon this request, we have to export probe
    parameters for "startupProbe" and "readinessProbe" so that StarlingX
    Armada application could set these parameters accordingly and thereby
    mariadb server can still work as expected with even only one pod in the
    cases of Controller node rebooting or shutdown.

Closes-bug: 1855474

    Change-Id: I3a8a99edd44d7ac4257ddf79b6baba5c52714324
    Signed-off-by: Hu, Yong <email address hidden>
    Co-Authored-By: Zhipeng, Liu <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2020-04-27:

#24

Yong, This LP is marked as gating for stx.3.0. Please cherry-pick the code changes to the stx.3.0 branch if applicable or add a note explaining why it shouldn't be cherry-picked.

Revision history for this message

Bill Zvonar (billzvonar) wrote on 2020-08-13:

#25

Yong - reminder: This LP is marked as gating for stx.3.0. Please cherry-pick the code changes to the stx.3.0 branch if applicable or add a note explaining why it shouldn't be cherry-picked.

tags:

added: stx.cherrypickneeded

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-08-20: Fix proposed to openstack-armada-app (r/stx.3.0)

#26

Fix proposed to branch: r/stx.3.0
Review: https://review.opendev.org/747093

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-08-27: Fix merged to openstack-armada-app (r/stx.3.0)

#27

Reviewed: https://review.opendev.org/747093
Committed: https://git.openstack.org/cgit/starlingx/openstack-armada-app/commit/?id=6afdaf7221a39fdceb15dbdd4dc8c309fc804d11
Submitter: Zuul
Branch: r/stx.3.0

commit 6afdaf7221a39fdceb15dbdd4dc8c309fc804d11
Author: Hu, Yong <email address hidden>
Date: Wed Dec 18 02:23:44 2019 +0000

Update mariadb chart to enable probe overrides

Closes-bug: 1855474

    Change-Id: I3a8a99edd44d7ac4257ddf79b6baba5c52714324
    Signed-off-by: Hu, Yong <email address hidden>
    Co-Authored-By: Zhipeng, Liu <email address hidden>

Bill Zvonar (billzvonar) on 2020-08-27

tags:

removed: stx.cherrypickneeded

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.